class: center, middle, inverse, title-slide .title[ # Lecture 4: Data Cleaning ] .author[ ### James Sears*
AFRE 891 SS 24
Michigan State University ] .date[ ### .small[
*Parts of these slides are adapted from
“Advanced Data Analytics”
by Nick Hagerty and
“Data Science for Economists”
by Grant McDermott.] ] --- <style type="text/css"> # CSS for including pauses in printed PDF output (see bottom of lecture) @media print { .has-continuation { display: block !important; } } .remark-code-line { font-size: 95%; } .small { font-size: 75%; } .scroll-output-full { height: 90%; overflow-y: scroll; } .scroll-output-75 { height: 75%; overflow-y: scroll; } </style> # Table of Contents 1. [Prologue](#prologue) 1. [Paths and Importing Data](#import) 1. [Keys and Relational Data](#keys) 1. [String Cleaning](#string) 1. [Number Storage](#numbers) 1. [Data Cleaning Checklist](#check) --- class: inverse, middle name: prologue # Prologue --- # Data Cleaning No that we know how to wrangle data in R, it's time to talk more specifically about .hi-medgrn[Data Cleaning]. * Importing data * Keys and relational data * Cleaning character strings * Number Storage -- Packages we will use today: * .hi-slate[stringr] * .hi-slate[tidyverse] * .hi-slate[nycflights13] ```r pacman::p_load(stringr, tidyverse, nycflights13) ``` --- class: inverse, middle name: import # Paths and Importing Data --- # Paths and Directories The .hi-medgrn[working directory] is your "current location" in the filesystem. What's your current working directory? ```r getwd() ``` ``` ## [1] "F:/OneDrive - Michigan State University/Teaching/MSU 2023-2024/AFRE 891 SS24/Lecture-Slides/04-Cleaning" ``` * This is an example of a full or .hi-slate[absolute path]. * Usually starts with `C:/` on Windows or `/` on Mac. * Defaults to the .hi-medgrn[folder containing your script/Rmd file] --- # Paths and Directories In contrast, .hi-blue[relative paths] are defined .hi-blue[relative to] the full path of the working directory. * Let's say your working directory is `"C:/Github/AFRE-891-SS24/"` * If you had a file saved at `"C:/Github/AFRE-891-SS24/assignment-1/assignment-1.Rmd"` * Its relative path would be `"assignment-1/assignment-1.Rmd"` * Relies on the folder/directory nesting within your filesystem In R you can use either an absolute or relative path in any given situation, but .hi-medgrn[relative paths] are usually .hi-medgrn[easier to work with]. --- # Paths and Directories: Best Practices .hi-medgrn[Option 1: use relative paths within GitHub repositories] * In the main folder (root directory), place * A **README** file that gives a basic overview of your project. * A **master script** that lists and runs all other scripts] * Use paths relative to included folder structure specific to each type of input/output, a la... --- # Paths and Directories: Best Practices ```r /my_project /rawData /processedData /code /1_clean /2_process /3_results /output /tables /figures /estimates ``` Relative path to access a processed data file named `parcels.dta` is then .center[ `"processedData/parcels.dta"` ] --- # Paths and Directories: Best Practices .hi-purple[Option 2: use relative paths within other Version Control] * Even if you're not using GitHub, you can use a similar folder structure for projects saved locally * Back up files/sync across computers with cloud storage (a la OneDrive) --- # Download this data Most data does not come nicely in R packages! You will download it from somewhere and load it in R. [.hi-dkorange[Go here: NYTimes COVID-19 Data]](https://github.com/nytimes/covid-19-data/tree/master/colleges) and click on the .hi-medgrn[Raw CSV] link. * This is a list of COVID-19 case counts reported at U.S. colleges and universities between July 2020 and May 2021. .hi-blue[Save] this file somewhere sensible on your computer * Perhaps the "data" subfolder in your cloned lecture 4 slides folder? --- # Setting the Working Directory You can change your working directory with `setwd(dir)`. Now: 1. .hi-medgrn[Find the location you saved your CSV file and copy the filepath.] * Manual navigation or use the .hi-slate[Files] window in bottom-right .pull-left[ <img src = "images/nav_dir1.png" width = "350" /img> ] .pull-right[ <img src = "images/nav_dir2.png" width = "350" /img> ] --- # Setting the Working Directory You can change your working directory with `setwd(dir)`. Now: 1. .hi-medgrn[Find the location you saved your CSV file and copy the filepath.] 2. .hi-blue[Use the console in R to set your working directory to that location.] For example: ```r setwd("F:/OneDrive - Michigan State University/Github/covid_ex/") ``` --- # setwd Best Practices .hi-purple[Pro tip: minimize use of `setwd()` in scripts.] * Someone else's working directory will be different from yours! * You want your code to be portable. * .hi-blue[Best:] set working directory to the project folder .hi-blue[outside the script] (like we just did) * .hi-medgrn[Better:] declare a `maindir` path to your project folder at the start of your script, set working directory to that path * .hi-slate[Worst:] changing working directories more than once in a script (barf) --- # Read in Data (with readr) The main reason we bother with working directories is to let us .hi-medgrn[read in and interact with data.] -- The tidyverse package `readr` provides lots of options to read data into R. Read in the college COVID data using `read_csv` and the .hi-purple[relative filepath]: ```r col <- read_csv("data/colleges.csv") ``` `View` this data to take a look at it. ```r str(col) ``` ``` ## spc_tbl_ [1,948 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## $ date : Date[1:1948], format: "2021-05-26" "2021-05-26" ... ## $ state : chr [1:1948] "Alabama" "Alabama" "Alabama" "Alabama" ... ## $ county : chr [1:1948] "Madison" "Montgomery" "Limestone" "Lee" ... ## $ city : chr [1:1948] "Huntsville" "Montgomery" "Athens" "Auburn" ... ## $ ipeds_id : chr [1:1948] "100654" "100724" "100812" "100858" ... ## $ college : chr [1:1948] "Alabama A&M University" "Alabama State University" "Athens State University" "Auburn University" ... ## $ cases : num [1:1948] 41 2 45 2742 220 ... ## $ cases_2021: num [1:1948] NA NA 10 567 80 NA 49 53 10 35 ... ## $ notes : chr [1:1948] NA NA NA NA ... ## - attr(*, "spec")= ## .. cols( ## .. date = col_date(format = ""), ## .. state = col_character(), ## .. county = col_character(), ## .. city = col_character(), ## .. ipeds_id = col_character(), ## .. college = col_character(), ## .. cases = col_double(), ## .. cases_2021 = col_double(), ## .. notes = col_character() ## .. ) ## - attr(*, "problems")=<externalptr> ``` --- # Reading in Data with readr .hi-slate[readr] can read a wide set of .hi-medgrn[plain text and delimited] files, as well as .hi-purple[R Data files] | File Type | Function | Use| |----|-----|-----| | CSV | `read_csv` | comma delimited<sup>1</sup> | | CSV | `read_csv2` | semicolon delimited, comma decimal mark | | TSV | `read_tsv` | tab delimited | | Delimited Plain Text | `read_delim` | Any delimiter | | Fixed Width Text | `read_fwf` | Fixed width | | R Data | `read_rds` | storage-efficient native R files | .footnote[More on readr options [.hi-dkorange[here.]](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_data-import.pdf)] --- # Reading in Data with readr .hi-slate[readr] will guess at column types, but often it makes sense to .hi-medgrn[manually specify what column types are] * i.e. FIPS codes with leading zeroes as character strings -- Include the `col_types = list()` argument to tell readr what the column types are before reading in: ```r col <- read_csv("data/colleges.csv", col_types = list( date = col_date(), state = col_character(), county = col_character(), city = col_character(), ipeds_id = col_character(), college = col_character(), cases = col_integer(), cases_2021 = col_integer(), notes = col_character() )) ``` --- # Reading in Data with readr We can also use .hi-medgrn[string abbreviations] in the list to simplify: ```r col <- read_csv("data/colleges.csv", col_types = list( date = "d", # can mix abbreviations/functions state = col_character(), county = "c", city = "c", ipeds_id = "c", college = "c", cases = "i", cases_2021 = "i", notes = "c" )) ``` --- # Reading in Data with readr Or if we don't want to specify variable names, we can use a .hi-blue[single string of abbreviations]: ```r col <- read_csv("data/colleges.csv", col_types = "dccccciic" ) ``` --- # Reading in Online Data Note that we can read files in .hi-medgrn[directly from the internet] without downloading them first * You .hi-blue[usually shouldn't] - what happens if the file location changes? ```r url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/colleges/colleges.csv" col <- read_csv(url) ``` --- # Reading in Data: Other Formats Often we need to read in other data types, for which we'll need .hi-slate[other packages] | File Type | Function(s) | Package| |----|-----|-----| | CSV | `fread` | .hi-slate[data.table] (good for large files) | | Excel (.xlsx, .xls) | `read_excel` | .hi-slate[readxl] | | Google Sheets | `read_sheet` | .hi-slate[googlesheets4] | | Stata, SAS, SPSS | `read_dta/read_sas/read_sav` | .hi-slate[haven] | | R Data (.rds) | `readRDS` | .hi-slate[base R]| --- # Challenge .hi-slate[Which state had the least total reported Covid-19 cases at colleges and universities?] * Is it Michigan? --- # Writing Out Data with readr .hi-slate[readr] also makes it easy to write (save) out processed files with the `write_X(object, path)` functions. | File Type | Function | Use| |----|-----|-----| | CSV | `write_csv` | comma delimited<sup>1</sup> | | CSV | `write_csv2` | semicolon delimited, comma decimal mark | | TSV | `write_tsv` | tab delimited | | Delimited Plain Text | `write_delim` | Any delimiter | | Fixed Width Text | `write_fwf` | Fixed width | | R Data | `write_rds` | storage-efficient native R files | --- # Writing Out Data with readr Let's say we want to produce a summary statistics table of .hi-blue[all cases] across .hi-blue[all colleges in a state], sorted .hi-blue[most to least] for the .hi-blue[10 states with highest caseloads] ```r state_tab <- group_by(col, state) %>% summarise(total_cases = sum(cases, na.rm = T)) %>% arrange(desc(total_cases)) %>% ungroup() %>% filter(row_number() <= 10) ``` -- We can write it out as a CSV with `write_csv(object, path)` ```r write_csv(state_tab, "data/state_top10.csv") ``` --- class: inverse, middle name: keys # Keys and Relational Data Images in this section are from [.hi-dkorange[R for Data Science]](https://r4ds.had.co.nz/relational-data.html) by Wickham & Grolemund, used under [.hi-dkorange[CC BY-NC-ND 3.0]](https://creativecommons.org/licenses/by-nc-nd/3.0/us/) and not included under this resource's overall CC license. --- # Relational data More often than not, we'll be working with .hi-medgrn[relational data:] multiple tables of data that have relations to each other .pull-left[ .center[<img src = "data:image/png;base64,#images/relational-nycflights.png" width = "500" />] ] .pull-right[ - `flights` connects to `planes` via a single variable, `tailnum`. - `flights` connects to `airlines` through the `carrier` variable. - `flights` connects to `airports` in two ways: via the `origin` and `dest` variables. - `flights` connects to `weather` via `origin` (the location), and `year`, `month`, `day` and `hour` (the time). ] --- # Keys To join relation data, we need .hi-blue[key variable(s)] that .hi-blue[uniquely identifies an observation]. * In `planes`, the key is `tailnum`. * In `weather`, the key consists of 5 variables: (`year`, `month`, `day`, `hour`, `origin`). --- # Keys There are two types of keys: 1. A .hi-medgrn[primary key] uniquely identifies an observation in .hi-medgrn[its own data frame]. * `planes$tailnum` is a .hi-medgrn[primary key] because it uniquely identifies each plane in the `planes` data frame. 1. A .hi-purple[foreign key] uniquely identifies an observation in .hi-purple[another data frame]. * `flights$tailnum` is a .hi-purple[foreign key] because it appears in the `flights` data frame where it matches each flight to a unique plane. A variable can be .hi-slate[both a primary key *and* a foreign key]. * For example, `origin` is part of the `weather` primary key, and is also a foreign key for the `airports` data frame. --- # Keys The .hi-medgrn[primary key] is the .hi-slate[first thing] you need to know about a new data frame. Once you think you know the primary key, .hi-medgrn[verify it]. Here's one way to do that: ```r planes %>% count(tailnum) %>% filter(n > 1) ``` ``` ## # A tibble: 0 × 2 ## # ℹ 2 variables: tailnum <chr>, n <int> ``` .font80[If `tailnum` is the primary key, we can't have any duplicate values!] --- # Keys You can write a .hi-medgrn[unit test] into your code to make sure uniqueness is true before proceeding: ```r dups_planes <- planes %>% count(tailnum) %>% filter(n > 1) *stopifnot(nrow(dups_planes) == 0) dups_weather <- weather |> # same thing using base R pipe count(year, month, day, hour, origin) |> filter(n > 1) *stopifnot(nrow(dups_weather) == 0) ``` ``` ## Error: nrow(dups_weather) == 0 is not TRUE ``` --- # Surrogate Keys What's the primary key in the `flights` data frame? Take a minute to investigate/verify. -- You might think it would be the date + the carrier + the flight or tail number, but neither of those are unique: ```r flights %>% count(year, month, day, carrier, flight) %>% filter(n > 1) ``` ``` ## # A tibble: 24 × 6 ## year month day carrier flight n ## <int> <int> <int> <chr> <int> <int> ## 1 2013 6 8 WN 2269 2 ## 2 2013 6 15 WN 2269 2 ## 3 2013 6 22 WN 2269 2 ## 4 2013 6 29 WN 2269 2 ## 5 2013 7 6 WN 2269 2 ## 6 2013 7 13 WN 2269 2 ## 7 2013 7 20 WN 2269 2 ## 8 2013 7 27 WN 2269 2 ## 9 2013 8 3 WN 2269 2 ## 10 2013 8 10 WN 2269 2 ## # ℹ 14 more rows ``` --- # Surrogate keys If a data frame lacks a primary key but it is tidy (each row is an observation), it's often useful to add in a .hi-blue[surrogate key]: ```r flights2 = flights %>% arrange(year, month, day, carrier, flight, sched_dep_time) %>% mutate(id = row_number()) %>% relocate(id) %>% head(8) ``` --- # Relations A .hi-medgrn[primary key] and the corresponding .hi-purple[foreign key] in another data frame form a **relation**. In general, relations are .hi-slate[one-to-many]: Each flight has one plane, but each plane has many flights. * Sometimes you'll see a .hi-slate[one-to-one] relation, but you can think of this as a special case of one-to-many. * You can also find .hi-slate[many-to-many] relations, but you can think of these as two one-to-many relations going in each direction. * There's a many-to-many relationship between airlines and airports: each airline flies to many airports; each airport hosts many airlines. --- # Relations **Note on Stata: NEVER USE `merge m:m`. JUST DON'T DO IT.** There is no scenario in which it will give you what you want. This syntax should not exist. If you are tempted, you are probably either confused or looking for `joinby`. --- # Relations `join` does **not** think about whether your key is unique, or what type of relation you have. * Instead, it simply returns all possible combinations of observations in your two dataframes: <img src = "data:image/png;base64,#images/join-one-to-many.png" width = "400" /> <img src = "data:image/png;base64,#images/join-many-to-many.png" width = "400" /> --- # Duplicate Keys What if you join by a key that is not actually unique, when you think it is? You'll get .hi-medgrn[extra rows with incorrect matches]: ```r flights_weather <- flights %>% left_join(weather, by=c("year", "month", "day", "origin")) nrow(flights_weather) ``` ``` ## [1] 8036575 ``` Now you no longer have a dataframe of unique flights. ```r nrow(flights) ``` ``` ## [1] 336776 ``` --- # Best Practice: Joins Here's an example of a good (safe) way to join `flights` and `planes`: .hi-blue[1\. Confirm the primary key in `planes` is unique] ```r # Confirm that tailnum is the primary key (unique ID) of planes dups_planes <- planes %>% count(tailnum) %>% filter(n > 1) stopifnot(nrow(dups_planes) == 0) ``` --- # Best Practice: Joins Here's an example of a good (safe) way to join `flights` and `planes`: .hi-blue[2\. Join, keeping the original join keys from both datasets] ```r # Join, keeping the join keys from both datasets flights_planes <- flights %>% left_join(planes %>% rename(year_built = year), by="tailnum", keep=TRUE) %>% rename(tailnum = tailnum.x, tailnum_planes = tailnum.y) ``` --- # Best Practice: Joins Here's an example of a good (safe) way to join `flights` and `planes`: .hi-blue[3\. Confirm the join was one-to-many] ```r # Confirm the join was 1:many stopifnot(nrow(flights) == nrow(flights_planes)) ``` --- class: inverse, middle name: string # String Cleaning Parts of this section are adapted from [.hi-dkorange[“Introduction to Data Science”]](http://rafalab.dfci.harvard.edu/dsbook/string-processing.html) by Rafael A. Irizarry, used under [.hi-dkorange[CC BY-NC-SA 4.0]](https://creativecommons.org/licenses/by-nc-sa/4.0). --- # String Cleaning Regardless of where we get them from, .hi-medgrn[character strings] often require a lot of cleaning work to get them into our desired formats * .hi-slate[Surveys:] report agricultural yields in a mix of bushels, pounds, hundred weight, tons, etc. * .hi-slate[Admin records:] manual entry of information prone to typos/inconsistencies Whether we want to convert to numeric values/dates or find matching info, we will likely need to do some pre-processing on our strings. -- Let's practice some .hi-purple[key string cleaning steps]. --- # String Cleaning Example Let's load in the raw data output from a web form asking students to report their height in inches: ```r library(dslabs) data(reported_heights) str(reported_heights) ``` ``` ## 'data.frame': 1095 obs. of 3 variables: ## $ time_stamp: chr "2014-09-02 13:40:36" "2014-09-02 13:46:59" "2014-09-02 13:59:20" "2014-09-02 14:51:53" ... ## $ sex : chr "Male" "Male" "Male" "Male" ... ## $ height : chr "75" "70" "68" "74" ... ``` -- Unfortunately `height` is not numeric. Can we coerce it to numeric? ```r heights2 <- reported_heights %>% mutate(height_num = as.numeric(height)) sum(is.na(heights2$height_num)) ``` ``` ## [1] 81 ``` --- # String Cleaning Example Yes, but we .hi-medgrn[lose a lot of information] because there are plenty of .hi-blue[non-numeric entries:] ```r heights_probs <- filter(heights2, is.na(height_num)) View(heights_probs) heights_probs$height ``` ``` ## [1] "5' 4\"" "165cm" "5'7" ## [4] ">9000" "5'7\"" "5'3\"" ## [7] "5 feet and 8.11 inches" "5'11" "5'9''" ## [10] "5'10''" "5,3" "6'" ## [13] "6,8" "5' 10" "Five foot eight inches" ## [16] "5'5\"" "5'2\"" "5,4" ## [19] "5'3" "5'10''" "5'3''" ## [22] "5'7''" "5'12" "2'33" ## [25] "5'11" "5'3\"" "5,8" ## [28] "5'6''" "5'4" "1,70" ## [31] "5'7.5''" "5'7.5''" "5'2\"" ## [34] "5' 7.78\"" "yyy" "5'5" ## [37] "5'8" "5'6" "5 feet 7inches" ## [40] "6*12" "5 .11" "5 11" ## [43] "5'4" "5'8\"" "5'5" ## [46] "5'7" "5'6" "5'11\"" ## [49] "5'7\"" "5'7" "5'8" ## [52] "5' 11\"" "6'1\"" "69\"" ## [55] "5' 7\"" "5'10''" "5'10" ## [58] "5'10" "5ft 9 inches" "5 ft 9 inches" ## [61] "5'2" "5'11" "5'11''" ## [64] "5'8\"" "708,661" "5 feet 6 inches" ## [67] "5'10''" "5'8" "6'3\"" ## [70] "649,606" "728,346" "6 04" ## [73] "5'9" "5'5''" "5'7\"" ## [76] "6'4\"" "5'4" "170 cm" ## [79] "7,283,465" "5'6" "5'6" ``` --- # String Cleaning Workflow Many of these entries have valuable information, so let's try to salvage as much as we can. The general way to proceed is: 1. Identify the most common patterns among the problematic entries. 2. Write an algorithm to correct these. 3. Review results to make sure your algorithm worked correctly. 4. Look at the remaining problematic entries. Tweak your algorithm or add another one. 5. Stop when all useful information is corrected (or when MB < MC). What are the .hi-slate[most common patterns?] -- * Strings of the form `x'y` or `x'y"` where `x` is feet and `y` is inches. * Strings of the form `x ft y inches`, except that "ft" and "inches" are inconsistent. --- # String Cleaning Workflow Many of these entries have valuable information, so let's try to salvage as much as we can. The general way to proceed is: 1. Identify the most common patterns among the problematic entries. 2. Write an algorithm to correct these. 3. Review results to make sure your algorithm worked correctly. 4. Look at the remaining problematic entries. Tweak your algorithm or add another one. 5. Stop when all useful information is corrected (or when MB < MC). .hi-slate[My suggested approach:]] 1. Try to convert everything to the pattern `x y`. 2. `separate` the feet and inches values. 3. Calculate total inches from feet and inches. --- # 1. Replace Punctuation Start by replacing 4 punctuation marks with spaces (note we have to .hi-slate[escape] the " with `\"`): ```r heights2 <- reported_heights %>% mutate(height_clean = str_replace_all(height, "'", " "), height_clean = str_replace_all(height_clean, ",", " "), height_clean = str_replace_all(height_clean, "\"", " "), height_clean = str_replace_all(height_clean, ",", " ")) heights2$height_clean ``` ``` ## [1] "75" "70" ## [3] "68" "74" ## [5] "61" "65" ## [7] "66" "62" ## [9] "66" "67" ## [11] "72" "6" ## [13] "69" "68" ## [15] "69" "66" ## [17] "75" "64" ## [19] "60" "67" ## [21] "66" "5 4 " ## [23] "70" "73" ## [25] "72" "69" ## [27] "69" "72" ## [29] "64" "72" ## [31] "75" "71" ## [33] "67" "66" ## [35] "67" "69" ## [37] "68" "66.75" ## [39] "72" "5.3" ## [41] "69" "68" ## [43] "63" "60" ## [45] "73" "74" ## [47] "74" "66" ## [49] "68" "73" ## [51] "70" "68" ## [53] "73" "70.5" ## [55] "165cm" "71" ## [57] "70" "67" ## [59] "69" "67" ## [61] "69" "73" ## [63] "74" "70" ## [65] "66" "511" ## [67] "72" "65" ## [69] "65" "70" ## [71] "73" "67" ## [73] "72" "68" ## [75] "68" "65" ## [77] "72" "71" ## [79] "65" "72" ## [81] "69" "70" ## [83] "72" "6" ## [85] "62" "65" ## [87] "70" "60" ## [89] "67" "62" ## [91] "71" "63" ## [93] "68" "64.1732" ## [95] "64" "71" ## [97] "68.5" "62" ## [99] "2" "70" ## [101] "72" "71" ## [103] "71" "69.6" ## [105] "65" "70" ## [107] "61" "63" ## [109] "70" "5 7" ## [111] "62" "68" ## [113] "73" "66.5" ## [115] "69" "74" ## [117] "71.5" "76" ## [119] "69" "74" ## [121] "74.5" "69" ## [123] "66" "64" ## [125] "78" ">9000" ## [127] "5 7 " "69" ## [129] "67" "63" ## [131] "74" "62" ## [133] "69" "64" ## [135] "71" "62.5" ## [137] "68" "67" ## [139] "71" "74" ## [141] "75" "65" ## [143] "68" "65" ## [145] "66" "72" ## [147] "73" "71" ## [149] "74" "5 3 " ## [151] "73" "68" ## [153] "77" "70.5" ## [155] "63" "69" ## [157] "69" "68.89" ## [159] "66.5" "64.173" ## [161] "63" "65" ## [163] "64" "63" ## [165] "63" "69" ## [167] "69" "64" ## [169] "62" "70" ## [171] "70" "59" ## [173] "65" "67.7" ## [175] "72" "74" ## [177] "71.7" "70.87" ## [179] "66" "72" ## [181] "74" "69" ## [183] "71" "70" ## [185] "70" "64" ## [187] "5 feet and 8.11 inches" "68" ## [189] "66" "64" ## [191] "67" "65" ## [193] "72" "5.25" ## [195] "70" "64.57" ## [197] "51" "63" ## [199] "70" "68" ## [201] "67" "5 11" ## [203] "69" "63" ## [205] "71" "70" ## [207] "64" "70" ## [209] "68" "66" ## [211] "69" "67" ## [213] "65" "72" ## [215] "72" "70" ## [217] "75" "64" ## [219] "66" "68" ## [221] "69" "73" ## [223] "66" "71" ## [225] "67" "70" ## [227] "67" "61" ## [229] "64" "64" ## [231] "5.5" "72" ## [233] "70" "65" ## [235] "11111" "5 9 " ## [237] "68" "67" ## [239] "59" "59" ## [241] "6" "6.5" ## [243] "72" "150" ## [245] "5 10 " "69" ## [247] "72" "69" ## [249] "68.11" "68.11" ## [251] "68" "103.2" ## [253] "65" "67" ## [255] "67" "68" ## [257] "69" "80" ## [259] "5.8" "68" ## [261] "19" "5" ## [263] "70" "60" ## [265] "60" "64.96" ## [267] "5.6" "175" ## [269] "177" "300" ## [271] "68" "70" ## [273] "66" "66" ## [275] "5 3" "72" ## [277] "68" "6 " ## [279] "72.05" "71" ## [281] "72.5" "70" ## [283] "72" "6" ## [285] "68" "70" ## [287] "5.9" "6 8" ## [289] "68" "66" ## [291] "5 10" "5.5" ## [293] "178" "75" ## [295] "66" "69" ## [297] "53" "71" ## [299] "70" "72" ## [301] "68.5" "68" ## [303] "163" "68" ## [305] "59.0551" "66.92" ## [307] "72" "72" ## [309] "6.2" "175" ## [311] "Five foot eight inches" "70" ## [313] "69" "61" ## [315] "6.2" "65" ## [317] "72" "66" ## [319] "5.8" "5.1" ## [321] "70" "65" ## [323] "68" "178" ## [325] "70" "71" ## [327] "74" "165" ## [329] "71" "74" ## [331] "64" "5.11" ## [333] "66" "5 5 " ## [335] "165" "74" ## [337] "75" "180" ## [339] "68" "5 2 " ## [341] "74.8" "69" ## [343] "68.4" "5.75" ## [345] "67" "65" ## [347] "68" "67" ## [349] "69.3" "169" ## [351] "67" "67" ## [353] "73" "75" ## [355] "70" "71" ## [357] "5 4" "72" ## [359] "70" "72" ## [361] "71.5" "72" ## [363] "7" "68.8976" ## [365] "68" "5.4" ## [367] "69" "70" ## [369] "75" "67" ## [371] "157" "6.1" ## [373] "169" "5 3" ## [375] "5.6" "214" ## [377] "183" "67.72" ## [379] "65" "69" ## [381] "71" "5.6" ## [383] "72" "72.4" ## [385] "69" "6" ## [387] "79.05" "162" ## [389] "178" "180" ## [391] "66.4" "72" ## [393] "5 10 " "66" ## [395] "69.29" "170" ## [397] "5 3 " "66.1416" ## [399] "78" "75" ## [401] "178" "70" ## [403] "66" "65" ## [405] "71" "0.7" ## [407] "68.5" "65" ## [409] "69" "190" ## [411] "70" "66" ## [413] "66" "63" ## [415] "71" "5.4" ## [417] "70" "72" ## [419] "184" "5 7 " ## [421] "68.5" "68" ## [423] "5.9" "70" ## [425] "68.5" "5 12" ## [427] "53.77" "66.92" ## [429] "73" "68" ## [431] "72" "70" ## [433] "5.6" "71.5" ## [435] "5.6" "184" ## [437] "72" "67" ## [439] "6" "167" ## [441] "72" "74" ## [443] "72" "72" ## [445] "67.7165" "71" ## [447] "65" "2 33" ## [449] "72" "68" ## [451] "5 11" "69" ## [453] "5 3 " "71" ## [455] "5.5" "5.2" ## [457] "67" "180" ## [459] "54" "5.5" ## [461] "5.5" "6.5" ## [463] "67" "68.1102" ## [465] "69" "5 8" ## [467] "62" "72" ## [469] "70" "71" ## [471] "64.961" "64" ## [473] "67" "180" ## [475] "66" "183" ## [477] "170" "5 6 " ## [479] "66" "70" ## [481] "66" "64" ## [483] "72" "70" ## [485] "172" "75" ## [487] "72" "72" ## [489] "612" "68.5" ## [491] "66" "5.11" ## [493] "71" "168" ## [495] "72" "5 4" ## [497] "70" "67" ## [499] "70" "70" ## [501] "72" "63" ## [503] "65" "63" ## [505] "1 70" "70" ## [507] "70.8" "172" ## [509] "87" "68" ## [511] "66" "77.1654" ## [513] "5.5" "176" ## [515] "5 7.5 " "5 7.5 " ## [517] "68" "66" ## [519] "74" "72" ## [521] "72.83" "77" ## [523] "111" "67" ## [525] "70" "64" ## [527] "65" "73" ## [529] "5 2 " "66.7" ## [531] "69" "67" ## [533] "173" "69" ## [535] "174" "176" ## [537] "175" "71" ## [539] "64" "67" ## [541] "70" "68" ## [543] "5 7.78 " "68.5" ## [545] "68" "67.71" ## [547] "66.5" "69" ## [549] "6.7" "64" ## [551] "60" "12" ## [553] "6" "5.1" ## [555] "74" "5.6" ## [557] "5.5" "67" ## [559] "68.8" "69" ## [561] "78" "70" ## [563] "66" "66" ## [565] "74" "71" ## [567] "yyy" "73" ## [569] "67" "72" ## [571] "70" "70.1" ## [573] "5.2" "69" ## [575] "5 5" "5 8" ## [577] "72" "72" ## [579] "72" "63" ## [581] "5 6" "67" ## [583] "73.2" "64" ## [585] "74" "73.62" ## [587] "72" "68.4" ## [589] "69" "61" ## [591] "5 feet 7inches" "89" ## [593] "69" "5.6" ## [595] "68" "65" ## [597] "68.9" "67.71" ## [599] "67" "67" ## [601] "59" "59" ## [603] "5.7" "183" ## [605] "71" "172" ## [607] "68" "66.9" ## [609] "34" "68.5" ## [611] "74" "25" ## [613] "76" "60" ## [615] "69" "6" ## [617] "69" "67" ## [619] "70" "68.5" ## [621] "70" "67" ## [623] "70" "69" ## [625] "65" "68.9" ## [627] "5.9" "168" ## [629] "66.9291" "70" ## [631] "6.5" "70.85" ## [633] "170" "175" ## [635] "64" "69" ## [637] "67" "72" ## [639] "6" "22" ## [641] "67" "72" ## [643] "72" "72.83" ## [645] "70" "63" ## [647] "71" "72" ## [649] "72" "72.44" ## [651] "70" "68" ## [653] "59" "5.11" ## [655] "684" "61.32" ## [657] "6" "1" ## [659] "68" "73" ## [661] "71" "69" ## [663] "71" "66.93" ## [665] "1" "58" ## [667] "55" "70" ## [669] "6*12" "66" ## [671] "65" "72" ## [673] "74" "5 .11" ## [675] "69" "87" ## [677] "162" "71" ## [679] "72" "68" ## [681] "69" "70" ## [683] "73" "165" ## [685] "184" "73.22" ## [687] "64" "70" ## [689] "75" "63" ## [691] "72.05" "6" ## [693] "68" "75" ## [695] "68" "173" ## [697] "70" "66.1417" ## [699] "74" "1.6" ## [701] "68" "68" ## [703] "70" "71" ## [705] "172" "170" ## [707] "70" "70.08" ## [709] "67.2" "69" ## [711] "70" "69" ## [713] "72.44" "69" ## [715] "72.45" "69" ## [717] "67" "75.98" ## [719] "71" "67" ## [721] "71" "5.7" ## [723] "5.5" "62" ## [725] "71" "74" ## [727] "69" "72" ## [729] "174" "170" ## [731] "68" "66.9291" ## [733] "63" "160" ## [735] "120" "120" ## [737] "23" "70" ## [739] "192" "68" ## [741] "69" "5 11" ## [743] "71" "69.3" ## [745] "167" "70.866" ## [747] "67" "67.72" ## [749] "70" "67" ## [751] "65" "67" ## [753] "62.4" "71" ## [755] "150" "72" ## [757] "68" "75" ## [759] "72" "66" ## [761] "71" "76" ## [763] "69" "69" ## [765] "1.7" "66" ## [767] "174" "69.29" ## [769] "72" "5.8" ## [771] "64" "66" ## [773] "70" "67" ## [775] "75.6" "71.5" ## [777] "6" "63" ## [779] "62" "69" ## [781] "68" "71.65" ## [783] "75" "68.8976" ## [785] "74" "65" ## [787] "5 4" "5 8 " ## [789] "5 5" "64" ## [791] "67" "69" ## [793] "5.8" "67" ## [795] "68.8976" "62.6" ## [797] "70" "69" ## [799] "65" "63" ## [801] "67.3" "64.2" ## [803] "63" "5.1" ## [805] "67" "5.11" ## [807] "61" "61" ## [809] "68.11" "66.14" ## [811] "67.5" "60" ## [813] "5.7" "67.72" ## [815] "66" "5 7" ## [817] "70" "5 6" ## [819] "64.5" "68" ## [821] "69" "68" ## [823] "5 11 " "76" ## [825] "70.8661" "66" ## [827] "67" "71" ## [829] "62.5" "5 7 " ## [831] "5 7" "172" ## [833] "60" "67" ## [835] "5 8" "75" ## [837] "65" "65" ## [839] "64" "62" ## [841] "180" "75" ## [843] "5 11 " "67" ## [845] "72" "5" ## [847] "68" "64" ## [849] "180" "180" ## [851] "6 1 " "75.4" ## [853] "68" "70" ## [855] "5.9" "63" ## [857] "5.2" "73" ## [859] "72" "68" ## [861] "61" "60" ## [863] "5.5" "63" ## [865] "67" "69 " ## [867] "74" "68" ## [869] "70" "73" ## [871] "70" "60" ## [873] "66" "68" ## [875] "5 7 " "65" ## [877] "71" "5 10 " ## [879] "65" "72.8346" ## [881] "5.51" "5 10" ## [883] "70" "65" ## [885] "5 10" "70" ## [887] "67.7" "62" ## [889] "5ft 9 inches" "5 ft 9 inches" ## [891] "5 2" "74" ## [893] "5 11" "70" ## [895] "64" "73" ## [897] "70" "66" ## [899] "66" "5.8" ## [901] "62" "66" ## [903] "64" "67" ## [905] "71" "78" ## [907] "69" "5.7" ## [909] "70" "65" ## [911] "70" "167" ## [913] "71" "68" ## [915] "67" "168" ## [917] "64" "68" ## [919] "60" "64" ## [921] "73.2" "6" ## [923] "6.1" "71" ## [925] "72.44" "67" ## [927] "71" "5 11 " ## [929] "68.89" "68.89" ## [931] "5.69" "178" ## [933] "182" "69" ## [935] "164" "66" ## [937] "68" "70" ## [939] "70" "5 8 " ## [941] "70.1" "185" ## [943] "6" "86" ## [945] "72" "66" ## [947] "66" "69" ## [949] "66" "78.74" ## [951] "5.7" "70" ## [953] "67" "66" ## [955] "70" "708 661" ## [957] "5.25" "72.8346" ## [959] "67.7" "72" ## [961] "69" "5.5" ## [963] "5 feet 6 inches" "5 10 " ## [965] "172" "67" ## [967] "6" "5 8" ## [969] "69" "64.9" ## [971] "160" "6 3 " ## [973] "69" "64.96" ## [975] "649 606" "10000" ## [977] "64.5" "64" ## [979] "65" "70" ## [981] "5.1" "67" ## [983] "152" "1" ## [985] "67.5" "180" ## [987] "70.86" "728 346" ## [989] "175" "70" ## [991] "158" "173" ## [993] "164" "6 04" ## [995] "169" "0" ## [997] "70" "185" ## [999] "168" "5 9" ## [1001] "169" "5 5 " ## [1003] "68" "174" ## [1005] "68" "68.11024" ## [1007] "6.3" "69" ## [1009] "179" "66" ## [1011] "69" "5 7 " ## [1013] "68.89" "73.22" ## [1015] "66" "5.5" ## [1017] "6" "75" ## [1019] "6" "170" ## [1021] "52" "52" ## [1023] "79" "70" ## [1025] "6" "172" ## [1027] "64" "68.11" ## [1029] "67" "158" ## [1031] "100" "75" ## [1033] "75" "81" ## [1035] "69" "68.8976378" ## [1037] "159" "76" ## [1039] "190" "69.6" ## [1041] "5.7" "63" ## [1043] "70" "72.44" ## [1045] "66" "170" ## [1047] "158" "73" ## [1049] "72" "60" ## [1051] "6 4 " "65" ## [1053] "66" "180" ## [1055] "68.9" "5.57" ## [1057] "71" "5 4" ## [1059] "210" "72" ## [1061] "68" "67" ## [1063] "88" "69" ## [1065] "64.2" "6" ## [1067] "63" "162" ## [1069] "66" "170 cm" ## [1071] "74" "72" ## [1073] "5.7" "71" ## [1075] "50" "170" ## [1077] "157" "186" ## [1079] "65" "170" ## [1081] "63" "7 283 465" ## [1083] "74" "67.7165" ## [1085] "5" "5" ## [1087] "34" "69" ## [1089] "69" "50" ## [1091] "69" "69" ## [1093] "161" "5 6" ## [1095] "5 6" ``` --- # 1. Replace Punctuation We can make this more concise with the .hi-medgrn["or" operator]: ```r heights2 <- reported_heights %>% mutate(height_clean = str_replace_all(height, "'|,|\"|,", " ")) heights2$height_clean ``` ``` ## [1] "75" "70" ## [3] "68" "74" ## [5] "61" "65" ## [7] "66" "62" ## [9] "66" "67" ## [11] "72" "6" ## [13] "69" "68" ## [15] "69" "66" ## [17] "75" "64" ## [19] "60" "67" ## [21] "66" "5 4 " ## [23] "70" "73" ## [25] "72" "69" ## [27] "69" "72" ## [29] "64" "72" ## [31] "75" "71" ## [33] "67" "66" ## [35] "67" "69" ## [37] "68" "66.75" ## [39] "72" "5.3" ## [41] "69" "68" ## [43] "63" "60" ## [45] "73" "74" ## [47] "74" "66" ## [49] "68" "73" ## [51] "70" "68" ## [53] "73" "70.5" ## [55] "165cm" "71" ## [57] "70" "67" ## [59] "69" "67" ## [61] "69" "73" ## [63] "74" "70" ## [65] "66" "511" ## [67] "72" "65" ## [69] "65" "70" ## [71] "73" "67" ## [73] "72" "68" ## [75] "68" "65" ## [77] "72" "71" ## [79] "65" "72" ## [81] "69" "70" ## [83] "72" "6" ## [85] "62" "65" ## [87] "70" "60" ## [89] "67" "62" ## [91] "71" "63" ## [93] "68" "64.1732" ## [95] "64" "71" ## [97] "68.5" "62" ## [99] "2" "70" ## [101] "72" "71" ## [103] "71" "69.6" ## [105] "65" "70" ## [107] "61" "63" ## [109] "70" "5 7" ## [111] "62" "68" ## [113] "73" "66.5" ## [115] "69" "74" ## [117] "71.5" "76" ## [119] "69" "74" ## [121] "74.5" "69" ## [123] "66" "64" ## [125] "78" ">9000" ## [127] "5 7 " "69" ## [129] "67" "63" ## [131] "74" "62" ## [133] "69" "64" ## [135] "71" "62.5" ## [137] "68" "67" ## [139] "71" "74" ## [141] "75" "65" ## [143] "68" "65" ## [145] "66" "72" ## [147] "73" "71" ## [149] "74" "5 3 " ## [151] "73" "68" ## [153] "77" "70.5" ## [155] "63" "69" ## [157] "69" "68.89" ## [159] "66.5" "64.173" ## [161] "63" "65" ## [163] "64" "63" ## [165] "63" "69" ## [167] "69" "64" ## [169] "62" "70" ## [171] "70" "59" ## [173] "65" "67.7" ## [175] "72" "74" ## [177] "71.7" "70.87" ## [179] "66" "72" ## [181] "74" "69" ## [183] "71" "70" ## [185] "70" "64" ## [187] "5 feet and 8.11 inches" "68" ## [189] "66" "64" ## [191] "67" "65" ## [193] "72" "5.25" ## [195] "70" "64.57" ## [197] "51" "63" ## [199] "70" "68" ## [201] "67" "5 11" ## [203] "69" "63" ## [205] "71" "70" ## [207] "64" "70" ## [209] "68" "66" ## [211] "69" "67" ## [213] "65" "72" ## [215] "72" "70" ## [217] "75" "64" ## [219] "66" "68" ## [221] "69" "73" ## [223] "66" "71" ## [225] "67" "70" ## [227] "67" "61" ## [229] "64" "64" ## [231] "5.5" "72" ## [233] "70" "65" ## [235] "11111" "5 9 " ## [237] "68" "67" ## [239] "59" "59" ## [241] "6" "6.5" ## [243] "72" "150" ## [245] "5 10 " "69" ## [247] "72" "69" ## [249] "68.11" "68.11" ## [251] "68" "103.2" ## [253] "65" "67" ## [255] "67" "68" ## [257] "69" "80" ## [259] "5.8" "68" ## [261] "19" "5" ## [263] "70" "60" ## [265] "60" "64.96" ## [267] "5.6" "175" ## [269] "177" "300" ## [271] "68" "70" ## [273] "66" "66" ## [275] "5 3" "72" ## [277] "68" "6 " ## [279] "72.05" "71" ## [281] "72.5" "70" ## [283] "72" "6" ## [285] "68" "70" ## [287] "5.9" "6 8" ## [289] "68" "66" ## [291] "5 10" "5.5" ## [293] "178" "75" ## [295] "66" "69" ## [297] "53" "71" ## [299] "70" "72" ## [301] "68.5" "68" ## [303] "163" "68" ## [305] "59.0551" "66.92" ## [307] "72" "72" ## [309] "6.2" "175" ## [311] "Five foot eight inches" "70" ## [313] "69" "61" ## [315] "6.2" "65" ## [317] "72" "66" ## [319] "5.8" "5.1" ## [321] "70" "65" ## [323] "68" "178" ## [325] "70" "71" ## [327] "74" "165" ## [329] "71" "74" ## [331] "64" "5.11" ## [333] "66" "5 5 " ## [335] "165" "74" ## [337] "75" "180" ## [339] "68" "5 2 " ## [341] "74.8" "69" ## [343] "68.4" "5.75" ## [345] "67" "65" ## [347] "68" "67" ## [349] "69.3" "169" ## [351] "67" "67" ## [353] "73" "75" ## [355] "70" "71" ## [357] "5 4" "72" ## [359] "70" "72" ## [361] "71.5" "72" ## [363] "7" "68.8976" ## [365] "68" "5.4" ## [367] "69" "70" ## [369] "75" "67" ## [371] "157" "6.1" ## [373] "169" "5 3" ## [375] "5.6" "214" ## [377] "183" "67.72" ## [379] "65" "69" ## [381] "71" "5.6" ## [383] "72" "72.4" ## [385] "69" "6" ## [387] "79.05" "162" ## [389] "178" "180" ## [391] "66.4" "72" ## [393] "5 10 " "66" ## [395] "69.29" "170" ## [397] "5 3 " "66.1416" ## [399] "78" "75" ## [401] "178" "70" ## [403] "66" "65" ## [405] "71" "0.7" ## [407] "68.5" "65" ## [409] "69" "190" ## [411] "70" "66" ## [413] "66" "63" ## [415] "71" "5.4" ## [417] "70" "72" ## [419] "184" "5 7 " ## [421] "68.5" "68" ## [423] "5.9" "70" ## [425] "68.5" "5 12" ## [427] "53.77" "66.92" ## [429] "73" "68" ## [431] "72" "70" ## [433] "5.6" "71.5" ## [435] "5.6" "184" ## [437] "72" "67" ## [439] "6" "167" ## [441] "72" "74" ## [443] "72" "72" ## [445] "67.7165" "71" ## [447] "65" "2 33" ## [449] "72" "68" ## [451] "5 11" "69" ## [453] "5 3 " "71" ## [455] "5.5" "5.2" ## [457] "67" "180" ## [459] "54" "5.5" ## [461] "5.5" "6.5" ## [463] "67" "68.1102" ## [465] "69" "5 8" ## [467] "62" "72" ## [469] "70" "71" ## [471] "64.961" "64" ## [473] "67" "180" ## [475] "66" "183" ## [477] "170" "5 6 " ## [479] "66" "70" ## [481] "66" "64" ## [483] "72" "70" ## [485] "172" "75" ## [487] "72" "72" ## [489] "612" "68.5" ## [491] "66" "5.11" ## [493] "71" "168" ## [495] "72" "5 4" ## [497] "70" "67" ## [499] "70" "70" ## [501] "72" "63" ## [503] "65" "63" ## [505] "1 70" "70" ## [507] "70.8" "172" ## [509] "87" "68" ## [511] "66" "77.1654" ## [513] "5.5" "176" ## [515] "5 7.5 " "5 7.5 " ## [517] "68" "66" ## [519] "74" "72" ## [521] "72.83" "77" ## [523] "111" "67" ## [525] "70" "64" ## [527] "65" "73" ## [529] "5 2 " "66.7" ## [531] "69" "67" ## [533] "173" "69" ## [535] "174" "176" ## [537] "175" "71" ## [539] "64" "67" ## [541] "70" "68" ## [543] "5 7.78 " "68.5" ## [545] "68" "67.71" ## [547] "66.5" "69" ## [549] "6.7" "64" ## [551] "60" "12" ## [553] "6" "5.1" ## [555] "74" "5.6" ## [557] "5.5" "67" ## [559] "68.8" "69" ## [561] "78" "70" ## [563] "66" "66" ## [565] "74" "71" ## [567] "yyy" "73" ## [569] "67" "72" ## [571] "70" "70.1" ## [573] "5.2" "69" ## [575] "5 5" "5 8" ## [577] "72" "72" ## [579] "72" "63" ## [581] "5 6" "67" ## [583] "73.2" "64" ## [585] "74" "73.62" ## [587] "72" "68.4" ## [589] "69" "61" ## [591] "5 feet 7inches" "89" ## [593] "69" "5.6" ## [595] "68" "65" ## [597] "68.9" "67.71" ## [599] "67" "67" ## [601] "59" "59" ## [603] "5.7" "183" ## [605] "71" "172" ## [607] "68" "66.9" ## [609] "34" "68.5" ## [611] "74" "25" ## [613] "76" "60" ## [615] "69" "6" ## [617] "69" "67" ## [619] "70" "68.5" ## [621] "70" "67" ## [623] "70" "69" ## [625] "65" "68.9" ## [627] "5.9" "168" ## [629] "66.9291" "70" ## [631] "6.5" "70.85" ## [633] "170" "175" ## [635] "64" "69" ## [637] "67" "72" ## [639] "6" "22" ## [641] "67" "72" ## [643] "72" "72.83" ## [645] "70" "63" ## [647] "71" "72" ## [649] "72" "72.44" ## [651] "70" "68" ## [653] "59" "5.11" ## [655] "684" "61.32" ## [657] "6" "1" ## [659] "68" "73" ## [661] "71" "69" ## [663] "71" "66.93" ## [665] "1" "58" ## [667] "55" "70" ## [669] "6*12" "66" ## [671] "65" "72" ## [673] "74" "5 .11" ## [675] "69" "87" ## [677] "162" "71" ## [679] "72" "68" ## [681] "69" "70" ## [683] "73" "165" ## [685] "184" "73.22" ## [687] "64" "70" ## [689] "75" "63" ## [691] "72.05" "6" ## [693] "68" "75" ## [695] "68" "173" ## [697] "70" "66.1417" ## [699] "74" "1.6" ## [701] "68" "68" ## [703] "70" "71" ## [705] "172" "170" ## [707] "70" "70.08" ## [709] "67.2" "69" ## [711] "70" "69" ## [713] "72.44" "69" ## [715] "72.45" "69" ## [717] "67" "75.98" ## [719] "71" "67" ## [721] "71" "5.7" ## [723] "5.5" "62" ## [725] "71" "74" ## [727] "69" "72" ## [729] "174" "170" ## [731] "68" "66.9291" ## [733] "63" "160" ## [735] "120" "120" ## [737] "23" "70" ## [739] "192" "68" ## [741] "69" "5 11" ## [743] "71" "69.3" ## [745] "167" "70.866" ## [747] "67" "67.72" ## [749] "70" "67" ## [751] "65" "67" ## [753] "62.4" "71" ## [755] "150" "72" ## [757] "68" "75" ## [759] "72" "66" ## [761] "71" "76" ## [763] "69" "69" ## [765] "1.7" "66" ## [767] "174" "69.29" ## [769] "72" "5.8" ## [771] "64" "66" ## [773] "70" "67" ## [775] "75.6" "71.5" ## [777] "6" "63" ## [779] "62" "69" ## [781] "68" "71.65" ## [783] "75" "68.8976" ## [785] "74" "65" ## [787] "5 4" "5 8 " ## [789] "5 5" "64" ## [791] "67" "69" ## [793] "5.8" "67" ## [795] "68.8976" "62.6" ## [797] "70" "69" ## [799] "65" "63" ## [801] "67.3" "64.2" ## [803] "63" "5.1" ## [805] "67" "5.11" ## [807] "61" "61" ## [809] "68.11" "66.14" ## [811] "67.5" "60" ## [813] "5.7" "67.72" ## [815] "66" "5 7" ## [817] "70" "5 6" ## [819] "64.5" "68" ## [821] "69" "68" ## [823] "5 11 " "76" ## [825] "70.8661" "66" ## [827] "67" "71" ## [829] "62.5" "5 7 " ## [831] "5 7" "172" ## [833] "60" "67" ## [835] "5 8" "75" ## [837] "65" "65" ## [839] "64" "62" ## [841] "180" "75" ## [843] "5 11 " "67" ## [845] "72" "5" ## [847] "68" "64" ## [849] "180" "180" ## [851] "6 1 " "75.4" ## [853] "68" "70" ## [855] "5.9" "63" ## [857] "5.2" "73" ## [859] "72" "68" ## [861] "61" "60" ## [863] "5.5" "63" ## [865] "67" "69 " ## [867] "74" "68" ## [869] "70" "73" ## [871] "70" "60" ## [873] "66" "68" ## [875] "5 7 " "65" ## [877] "71" "5 10 " ## [879] "65" "72.8346" ## [881] "5.51" "5 10" ## [883] "70" "65" ## [885] "5 10" "70" ## [887] "67.7" "62" ## [889] "5ft 9 inches" "5 ft 9 inches" ## [891] "5 2" "74" ## [893] "5 11" "70" ## [895] "64" "73" ## [897] "70" "66" ## [899] "66" "5.8" ## [901] "62" "66" ## [903] "64" "67" ## [905] "71" "78" ## [907] "69" "5.7" ## [909] "70" "65" ## [911] "70" "167" ## [913] "71" "68" ## [915] "67" "168" ## [917] "64" "68" ## [919] "60" "64" ## [921] "73.2" "6" ## [923] "6.1" "71" ## [925] "72.44" "67" ## [927] "71" "5 11 " ## [929] "68.89" "68.89" ## [931] "5.69" "178" ## [933] "182" "69" ## [935] "164" "66" ## [937] "68" "70" ## [939] "70" "5 8 " ## [941] "70.1" "185" ## [943] "6" "86" ## [945] "72" "66" ## [947] "66" "69" ## [949] "66" "78.74" ## [951] "5.7" "70" ## [953] "67" "66" ## [955] "70" "708 661" ## [957] "5.25" "72.8346" ## [959] "67.7" "72" ## [961] "69" "5.5" ## [963] "5 feet 6 inches" "5 10 " ## [965] "172" "67" ## [967] "6" "5 8" ## [969] "69" "64.9" ## [971] "160" "6 3 " ## [973] "69" "64.96" ## [975] "649 606" "10000" ## [977] "64.5" "64" ## [979] "65" "70" ## [981] "5.1" "67" ## [983] "152" "1" ## [985] "67.5" "180" ## [987] "70.86" "728 346" ## [989] "175" "70" ## [991] "158" "173" ## [993] "164" "6 04" ## [995] "169" "0" ## [997] "70" "185" ## [999] "168" "5 9" ## [1001] "169" "5 5 " ## [1003] "68" "174" ## [1005] "68" "68.11024" ## [1007] "6.3" "69" ## [1009] "179" "66" ## [1011] "69" "5 7 " ## [1013] "68.89" "73.22" ## [1015] "66" "5.5" ## [1017] "6" "75" ## [1019] "6" "170" ## [1021] "52" "52" ## [1023] "79" "70" ## [1025] "6" "172" ## [1027] "64" "68.11" ## [1029] "67" "158" ## [1031] "100" "75" ## [1033] "75" "81" ## [1035] "69" "68.8976378" ## [1037] "159" "76" ## [1039] "190" "69.6" ## [1041] "5.7" "63" ## [1043] "70" "72.44" ## [1045] "66" "170" ## [1047] "158" "73" ## [1049] "72" "60" ## [1051] "6 4 " "65" ## [1053] "66" "180" ## [1055] "68.9" "5.57" ## [1057] "71" "5 4" ## [1059] "210" "72" ## [1061] "68" "67" ## [1063] "88" "69" ## [1065] "64.2" "6" ## [1067] "63" "162" ## [1069] "66" "170 cm" ## [1071] "74" "72" ## [1073] "5.7" "71" ## [1075] "50" "170" ## [1077] "157" "186" ## [1079] "65" "170" ## [1081] "63" "7 283 465" ## [1083] "74" "67.7165" ## [1085] "5" "5" ## [1087] "34" "69" ## [1089] "69" "50" ## [1091] "69" "69" ## [1093] "161" "5 6" ## [1095] "5 6" ``` --- # 2. Remove Common Words + Extra Space Next, get rid of some common words and .hi-blue[trim extra spaces]: ```r heights2 <- reported_heights %>% mutate(height_clean = str_replace_all(height, "'|,|\"|,|ft|feet|inches|and", " "), height_clean = str_trim(height_clean)) # remove whitespace from start-end of string heights2$height_clean ``` ``` ## [1] "75" "70" "68" "74" ## [5] "61" "65" "66" "62" ## [9] "66" "67" "72" "6" ## [13] "69" "68" "69" "66" ## [17] "75" "64" "60" "67" ## [21] "66" "5 4" "70" "73" ## [25] "72" "69" "69" "72" ## [29] "64" "72" "75" "71" ## [33] "67" "66" "67" "69" ## [37] "68" "66.75" "72" "5.3" ## [41] "69" "68" "63" "60" ## [45] "73" "74" "74" "66" ## [49] "68" "73" "70" "68" ## [53] "73" "70.5" "165cm" "71" ## [57] "70" "67" "69" "67" ## [61] "69" "73" "74" "70" ## [65] "66" "511" "72" "65" ## [69] "65" "70" "73" "67" ## [73] "72" "68" "68" "65" ## [77] "72" "71" "65" "72" ## [81] "69" "70" "72" "6" ## [85] "62" "65" "70" "60" ## [89] "67" "62" "71" "63" ## [93] "68" "64.1732" "64" "71" ## [97] "68.5" "62" "2" "70" ## [101] "72" "71" "71" "69.6" ## [105] "65" "70" "61" "63" ## [109] "70" "5 7" "62" "68" ## [113] "73" "66.5" "69" "74" ## [117] "71.5" "76" "69" "74" ## [121] "74.5" "69" "66" "64" ## [125] "78" ">9000" "5 7" "69" ## [129] "67" "63" "74" "62" ## [133] "69" "64" "71" "62.5" ## [137] "68" "67" "71" "74" ## [141] "75" "65" "68" "65" ## [145] "66" "72" "73" "71" ## [149] "74" "5 3" "73" "68" ## [153] "77" "70.5" "63" "69" ## [157] "69" "68.89" "66.5" "64.173" ## [161] "63" "65" "64" "63" ## [165] "63" "69" "69" "64" ## [169] "62" "70" "70" "59" ## [173] "65" "67.7" "72" "74" ## [177] "71.7" "70.87" "66" "72" ## [181] "74" "69" "71" "70" ## [185] "70" "64" "5 8.11" "68" ## [189] "66" "64" "67" "65" ## [193] "72" "5.25" "70" "64.57" ## [197] "51" "63" "70" "68" ## [201] "67" "5 11" "69" "63" ## [205] "71" "70" "64" "70" ## [209] "68" "66" "69" "67" ## [213] "65" "72" "72" "70" ## [217] "75" "64" "66" "68" ## [221] "69" "73" "66" "71" ## [225] "67" "70" "67" "61" ## [229] "64" "64" "5.5" "72" ## [233] "70" "65" "11111" "5 9" ## [237] "68" "67" "59" "59" ## [241] "6" "6.5" "72" "150" ## [245] "5 10" "69" "72" "69" ## [249] "68.11" "68.11" "68" "103.2" ## [253] "65" "67" "67" "68" ## [257] "69" "80" "5.8" "68" ## [261] "19" "5" "70" "60" ## [265] "60" "64.96" "5.6" "175" ## [269] "177" "300" "68" "70" ## [273] "66" "66" "5 3" "72" ## [277] "68" "6" "72.05" "71" ## [281] "72.5" "70" "72" "6" ## [285] "68" "70" "5.9" "6 8" ## [289] "68" "66" "5 10" "5.5" ## [293] "178" "75" "66" "69" ## [297] "53" "71" "70" "72" ## [301] "68.5" "68" "163" "68" ## [305] "59.0551" "66.92" "72" "72" ## [309] "6.2" "175" "Five foot eight" "70" ## [313] "69" "61" "6.2" "65" ## [317] "72" "66" "5.8" "5.1" ## [321] "70" "65" "68" "178" ## [325] "70" "71" "74" "165" ## [329] "71" "74" "64" "5.11" ## [333] "66" "5 5" "165" "74" ## [337] "75" "180" "68" "5 2" ## [341] "74.8" "69" "68.4" "5.75" ## [345] "67" "65" "68" "67" ## [349] "69.3" "169" "67" "67" ## [353] "73" "75" "70" "71" ## [357] "5 4" "72" "70" "72" ## [361] "71.5" "72" "7" "68.8976" ## [365] "68" "5.4" "69" "70" ## [369] "75" "67" "157" "6.1" ## [373] "169" "5 3" "5.6" "214" ## [377] "183" "67.72" "65" "69" ## [381] "71" "5.6" "72" "72.4" ## [385] "69" "6" "79.05" "162" ## [389] "178" "180" "66.4" "72" ## [393] "5 10" "66" "69.29" "170" ## [397] "5 3" "66.1416" "78" "75" ## [401] "178" "70" "66" "65" ## [405] "71" "0.7" "68.5" "65" ## [409] "69" "190" "70" "66" ## [413] "66" "63" "71" "5.4" ## [417] "70" "72" "184" "5 7" ## [421] "68.5" "68" "5.9" "70" ## [425] "68.5" "5 12" "53.77" "66.92" ## [429] "73" "68" "72" "70" ## [433] "5.6" "71.5" "5.6" "184" ## [437] "72" "67" "6" "167" ## [441] "72" "74" "72" "72" ## [445] "67.7165" "71" "65" "2 33" ## [449] "72" "68" "5 11" "69" ## [453] "5 3" "71" "5.5" "5.2" ## [457] "67" "180" "54" "5.5" ## [461] "5.5" "6.5" "67" "68.1102" ## [465] "69" "5 8" "62" "72" ## [469] "70" "71" "64.961" "64" ## [473] "67" "180" "66" "183" ## [477] "170" "5 6" "66" "70" ## [481] "66" "64" "72" "70" ## [485] "172" "75" "72" "72" ## [489] "612" "68.5" "66" "5.11" ## [493] "71" "168" "72" "5 4" ## [497] "70" "67" "70" "70" ## [501] "72" "63" "65" "63" ## [505] "1 70" "70" "70.8" "172" ## [509] "87" "68" "66" "77.1654" ## [513] "5.5" "176" "5 7.5" "5 7.5" ## [517] "68" "66" "74" "72" ## [521] "72.83" "77" "111" "67" ## [525] "70" "64" "65" "73" ## [529] "5 2" "66.7" "69" "67" ## [533] "173" "69" "174" "176" ## [537] "175" "71" "64" "67" ## [541] "70" "68" "5 7.78" "68.5" ## [545] "68" "67.71" "66.5" "69" ## [549] "6.7" "64" "60" "12" ## [553] "6" "5.1" "74" "5.6" ## [557] "5.5" "67" "68.8" "69" ## [561] "78" "70" "66" "66" ## [565] "74" "71" "yyy" "73" ## [569] "67" "72" "70" "70.1" ## [573] "5.2" "69" "5 5" "5 8" ## [577] "72" "72" "72" "63" ## [581] "5 6" "67" "73.2" "64" ## [585] "74" "73.62" "72" "68.4" ## [589] "69" "61" "5 7" "89" ## [593] "69" "5.6" "68" "65" ## [597] "68.9" "67.71" "67" "67" ## [601] "59" "59" "5.7" "183" ## [605] "71" "172" "68" "66.9" ## [609] "34" "68.5" "74" "25" ## [613] "76" "60" "69" "6" ## [617] "69" "67" "70" "68.5" ## [621] "70" "67" "70" "69" ## [625] "65" "68.9" "5.9" "168" ## [629] "66.9291" "70" "6.5" "70.85" ## [633] "170" "175" "64" "69" ## [637] "67" "72" "6" "22" ## [641] "67" "72" "72" "72.83" ## [645] "70" "63" "71" "72" ## [649] "72" "72.44" "70" "68" ## [653] "59" "5.11" "684" "61.32" ## [657] "6" "1" "68" "73" ## [661] "71" "69" "71" "66.93" ## [665] "1" "58" "55" "70" ## [669] "6*12" "66" "65" "72" ## [673] "74" "5 .11" "69" "87" ## [677] "162" "71" "72" "68" ## [681] "69" "70" "73" "165" ## [685] "184" "73.22" "64" "70" ## [689] "75" "63" "72.05" "6" ## [693] "68" "75" "68" "173" ## [697] "70" "66.1417" "74" "1.6" ## [701] "68" "68" "70" "71" ## [705] "172" "170" "70" "70.08" ## [709] "67.2" "69" "70" "69" ## [713] "72.44" "69" "72.45" "69" ## [717] "67" "75.98" "71" "67" ## [721] "71" "5.7" "5.5" "62" ## [725] "71" "74" "69" "72" ## [729] "174" "170" "68" "66.9291" ## [733] "63" "160" "120" "120" ## [737] "23" "70" "192" "68" ## [741] "69" "5 11" "71" "69.3" ## [745] "167" "70.866" "67" "67.72" ## [749] "70" "67" "65" "67" ## [753] "62.4" "71" "150" "72" ## [757] "68" "75" "72" "66" ## [761] "71" "76" "69" "69" ## [765] "1.7" "66" "174" "69.29" ## [769] "72" "5.8" "64" "66" ## [773] "70" "67" "75.6" "71.5" ## [777] "6" "63" "62" "69" ## [781] "68" "71.65" "75" "68.8976" ## [785] "74" "65" "5 4" "5 8" ## [789] "5 5" "64" "67" "69" ## [793] "5.8" "67" "68.8976" "62.6" ## [797] "70" "69" "65" "63" ## [801] "67.3" "64.2" "63" "5.1" ## [805] "67" "5.11" "61" "61" ## [809] "68.11" "66.14" "67.5" "60" ## [813] "5.7" "67.72" "66" "5 7" ## [817] "70" "5 6" "64.5" "68" ## [821] "69" "68" "5 11" "76" ## [825] "70.8661" "66" "67" "71" ## [829] "62.5" "5 7" "5 7" "172" ## [833] "60" "67" "5 8" "75" ## [837] "65" "65" "64" "62" ## [841] "180" "75" "5 11" "67" ## [845] "72" "5" "68" "64" ## [849] "180" "180" "6 1" "75.4" ## [853] "68" "70" "5.9" "63" ## [857] "5.2" "73" "72" "68" ## [861] "61" "60" "5.5" "63" ## [865] "67" "69" "74" "68" ## [869] "70" "73" "70" "60" ## [873] "66" "68" "5 7" "65" ## [877] "71" "5 10" "65" "72.8346" ## [881] "5.51" "5 10" "70" "65" ## [885] "5 10" "70" "67.7" "62" ## [889] "5 9" "5 9" "5 2" "74" ## [893] "5 11" "70" "64" "73" ## [897] "70" "66" "66" "5.8" ## [901] "62" "66" "64" "67" ## [905] "71" "78" "69" "5.7" ## [909] "70" "65" "70" "167" ## [913] "71" "68" "67" "168" ## [917] "64" "68" "60" "64" ## [921] "73.2" "6" "6.1" "71" ## [925] "72.44" "67" "71" "5 11" ## [929] "68.89" "68.89" "5.69" "178" ## [933] "182" "69" "164" "66" ## [937] "68" "70" "70" "5 8" ## [941] "70.1" "185" "6" "86" ## [945] "72" "66" "66" "69" ## [949] "66" "78.74" "5.7" "70" ## [953] "67" "66" "70" "708 661" ## [957] "5.25" "72.8346" "67.7" "72" ## [961] "69" "5.5" "5 6" "5 10" ## [965] "172" "67" "6" "5 8" ## [969] "69" "64.9" "160" "6 3" ## [973] "69" "64.96" "649 606" "10000" ## [977] "64.5" "64" "65" "70" ## [981] "5.1" "67" "152" "1" ## [985] "67.5" "180" "70.86" "728 346" ## [989] "175" "70" "158" "173" ## [993] "164" "6 04" "169" "0" ## [997] "70" "185" "168" "5 9" ## [1001] "169" "5 5" "68" "174" ## [1005] "68" "68.11024" "6.3" "69" ## [1009] "179" "66" "69" "5 7" ## [1013] "68.89" "73.22" "66" "5.5" ## [1017] "6" "75" "6" "170" ## [1021] "52" "52" "79" "70" ## [1025] "6" "172" "64" "68.11" ## [1029] "67" "158" "100" "75" ## [1033] "75" "81" "69" "68.8976378" ## [1037] "159" "76" "190" "69.6" ## [1041] "5.7" "63" "70" "72.44" ## [1045] "66" "170" "158" "73" ## [1049] "72" "60" "6 4" "65" ## [1053] "66" "180" "68.9" "5.57" ## [1057] "71" "5 4" "210" "72" ## [1061] "68" "67" "88" "69" ## [1065] "64.2" "6" "63" "162" ## [1069] "66" "170 cm" "74" "72" ## [1073] "5.7" "71" "50" "170" ## [1077] "157" "186" "65" "170" ## [1081] "63" "7 283 465" "74" "67.7165" ## [1085] "5" "5" "34" "69" ## [1089] "69" "50" "69" "69" ## [1093] "161" "5 6" "5 6" ``` --- # 2. Remove Common Words + Extra Space Also remove extra space .hi-purple[before/after/within strings]: ```r heights2 <- reported_heights %>% mutate(height_clean = str_replace_all(height, "'|,|\"|,|ft|feet|inches|and|cm", " "), height_clean = str_squish(height_clean)) heights2$height_clean ``` ``` ## [1] "75" "70" "68" "74" ## [5] "61" "65" "66" "62" ## [9] "66" "67" "72" "6" ## [13] "69" "68" "69" "66" ## [17] "75" "64" "60" "67" ## [21] "66" "5 4" "70" "73" ## [25] "72" "69" "69" "72" ## [29] "64" "72" "75" "71" ## [33] "67" "66" "67" "69" ## [37] "68" "66.75" "72" "5.3" ## [41] "69" "68" "63" "60" ## [45] "73" "74" "74" "66" ## [49] "68" "73" "70" "68" ## [53] "73" "70.5" "165" "71" ## [57] "70" "67" "69" "67" ## [61] "69" "73" "74" "70" ## [65] "66" "511" "72" "65" ## [69] "65" "70" "73" "67" ## [73] "72" "68" "68" "65" ## [77] "72" "71" "65" "72" ## [81] "69" "70" "72" "6" ## [85] "62" "65" "70" "60" ## [89] "67" "62" "71" "63" ## [93] "68" "64.1732" "64" "71" ## [97] "68.5" "62" "2" "70" ## [101] "72" "71" "71" "69.6" ## [105] "65" "70" "61" "63" ## [109] "70" "5 7" "62" "68" ## [113] "73" "66.5" "69" "74" ## [117] "71.5" "76" "69" "74" ## [121] "74.5" "69" "66" "64" ## [125] "78" ">9000" "5 7" "69" ## [129] "67" "63" "74" "62" ## [133] "69" "64" "71" "62.5" ## [137] "68" "67" "71" "74" ## [141] "75" "65" "68" "65" ## [145] "66" "72" "73" "71" ## [149] "74" "5 3" "73" "68" ## [153] "77" "70.5" "63" "69" ## [157] "69" "68.89" "66.5" "64.173" ## [161] "63" "65" "64" "63" ## [165] "63" "69" "69" "64" ## [169] "62" "70" "70" "59" ## [173] "65" "67.7" "72" "74" ## [177] "71.7" "70.87" "66" "72" ## [181] "74" "69" "71" "70" ## [185] "70" "64" "5 8.11" "68" ## [189] "66" "64" "67" "65" ## [193] "72" "5.25" "70" "64.57" ## [197] "51" "63" "70" "68" ## [201] "67" "5 11" "69" "63" ## [205] "71" "70" "64" "70" ## [209] "68" "66" "69" "67" ## [213] "65" "72" "72" "70" ## [217] "75" "64" "66" "68" ## [221] "69" "73" "66" "71" ## [225] "67" "70" "67" "61" ## [229] "64" "64" "5.5" "72" ## [233] "70" "65" "11111" "5 9" ## [237] "68" "67" "59" "59" ## [241] "6" "6.5" "72" "150" ## [245] "5 10" "69" "72" "69" ## [249] "68.11" "68.11" "68" "103.2" ## [253] "65" "67" "67" "68" ## [257] "69" "80" "5.8" "68" ## [261] "19" "5" "70" "60" ## [265] "60" "64.96" "5.6" "175" ## [269] "177" "300" "68" "70" ## [273] "66" "66" "5 3" "72" ## [277] "68" "6" "72.05" "71" ## [281] "72.5" "70" "72" "6" ## [285] "68" "70" "5.9" "6 8" ## [289] "68" "66" "5 10" "5.5" ## [293] "178" "75" "66" "69" ## [297] "53" "71" "70" "72" ## [301] "68.5" "68" "163" "68" ## [305] "59.0551" "66.92" "72" "72" ## [309] "6.2" "175" "Five foot eight" "70" ## [313] "69" "61" "6.2" "65" ## [317] "72" "66" "5.8" "5.1" ## [321] "70" "65" "68" "178" ## [325] "70" "71" "74" "165" ## [329] "71" "74" "64" "5.11" ## [333] "66" "5 5" "165" "74" ## [337] "75" "180" "68" "5 2" ## [341] "74.8" "69" "68.4" "5.75" ## [345] "67" "65" "68" "67" ## [349] "69.3" "169" "67" "67" ## [353] "73" "75" "70" "71" ## [357] "5 4" "72" "70" "72" ## [361] "71.5" "72" "7" "68.8976" ## [365] "68" "5.4" "69" "70" ## [369] "75" "67" "157" "6.1" ## [373] "169" "5 3" "5.6" "214" ## [377] "183" "67.72" "65" "69" ## [381] "71" "5.6" "72" "72.4" ## [385] "69" "6" "79.05" "162" ## [389] "178" "180" "66.4" "72" ## [393] "5 10" "66" "69.29" "170" ## [397] "5 3" "66.1416" "78" "75" ## [401] "178" "70" "66" "65" ## [405] "71" "0.7" "68.5" "65" ## [409] "69" "190" "70" "66" ## [413] "66" "63" "71" "5.4" ## [417] "70" "72" "184" "5 7" ## [421] "68.5" "68" "5.9" "70" ## [425] "68.5" "5 12" "53.77" "66.92" ## [429] "73" "68" "72" "70" ## [433] "5.6" "71.5" "5.6" "184" ## [437] "72" "67" "6" "167" ## [441] "72" "74" "72" "72" ## [445] "67.7165" "71" "65" "2 33" ## [449] "72" "68" "5 11" "69" ## [453] "5 3" "71" "5.5" "5.2" ## [457] "67" "180" "54" "5.5" ## [461] "5.5" "6.5" "67" "68.1102" ## [465] "69" "5 8" "62" "72" ## [469] "70" "71" "64.961" "64" ## [473] "67" "180" "66" "183" ## [477] "170" "5 6" "66" "70" ## [481] "66" "64" "72" "70" ## [485] "172" "75" "72" "72" ## [489] "612" "68.5" "66" "5.11" ## [493] "71" "168" "72" "5 4" ## [497] "70" "67" "70" "70" ## [501] "72" "63" "65" "63" ## [505] "1 70" "70" "70.8" "172" ## [509] "87" "68" "66" "77.1654" ## [513] "5.5" "176" "5 7.5" "5 7.5" ## [517] "68" "66" "74" "72" ## [521] "72.83" "77" "111" "67" ## [525] "70" "64" "65" "73" ## [529] "5 2" "66.7" "69" "67" ## [533] "173" "69" "174" "176" ## [537] "175" "71" "64" "67" ## [541] "70" "68" "5 7.78" "68.5" ## [545] "68" "67.71" "66.5" "69" ## [549] "6.7" "64" "60" "12" ## [553] "6" "5.1" "74" "5.6" ## [557] "5.5" "67" "68.8" "69" ## [561] "78" "70" "66" "66" ## [565] "74" "71" "yyy" "73" ## [569] "67" "72" "70" "70.1" ## [573] "5.2" "69" "5 5" "5 8" ## [577] "72" "72" "72" "63" ## [581] "5 6" "67" "73.2" "64" ## [585] "74" "73.62" "72" "68.4" ## [589] "69" "61" "5 7" "89" ## [593] "69" "5.6" "68" "65" ## [597] "68.9" "67.71" "67" "67" ## [601] "59" "59" "5.7" "183" ## [605] "71" "172" "68" "66.9" ## [609] "34" "68.5" "74" "25" ## [613] "76" "60" "69" "6" ## [617] "69" "67" "70" "68.5" ## [621] "70" "67" "70" "69" ## [625] "65" "68.9" "5.9" "168" ## [629] "66.9291" "70" "6.5" "70.85" ## [633] "170" "175" "64" "69" ## [637] "67" "72" "6" "22" ## [641] "67" "72" "72" "72.83" ## [645] "70" "63" "71" "72" ## [649] "72" "72.44" "70" "68" ## [653] "59" "5.11" "684" "61.32" ## [657] "6" "1" "68" "73" ## [661] "71" "69" "71" "66.93" ## [665] "1" "58" "55" "70" ## [669] "6*12" "66" "65" "72" ## [673] "74" "5 .11" "69" "87" ## [677] "162" "71" "72" "68" ## [681] "69" "70" "73" "165" ## [685] "184" "73.22" "64" "70" ## [689] "75" "63" "72.05" "6" ## [693] "68" "75" "68" "173" ## [697] "70" "66.1417" "74" "1.6" ## [701] "68" "68" "70" "71" ## [705] "172" "170" "70" "70.08" ## [709] "67.2" "69" "70" "69" ## [713] "72.44" "69" "72.45" "69" ## [717] "67" "75.98" "71" "67" ## [721] "71" "5.7" "5.5" "62" ## [725] "71" "74" "69" "72" ## [729] "174" "170" "68" "66.9291" ## [733] "63" "160" "120" "120" ## [737] "23" "70" "192" "68" ## [741] "69" "5 11" "71" "69.3" ## [745] "167" "70.866" "67" "67.72" ## [749] "70" "67" "65" "67" ## [753] "62.4" "71" "150" "72" ## [757] "68" "75" "72" "66" ## [761] "71" "76" "69" "69" ## [765] "1.7" "66" "174" "69.29" ## [769] "72" "5.8" "64" "66" ## [773] "70" "67" "75.6" "71.5" ## [777] "6" "63" "62" "69" ## [781] "68" "71.65" "75" "68.8976" ## [785] "74" "65" "5 4" "5 8" ## [789] "5 5" "64" "67" "69" ## [793] "5.8" "67" "68.8976" "62.6" ## [797] "70" "69" "65" "63" ## [801] "67.3" "64.2" "63" "5.1" ## [805] "67" "5.11" "61" "61" ## [809] "68.11" "66.14" "67.5" "60" ## [813] "5.7" "67.72" "66" "5 7" ## [817] "70" "5 6" "64.5" "68" ## [821] "69" "68" "5 11" "76" ## [825] "70.8661" "66" "67" "71" ## [829] "62.5" "5 7" "5 7" "172" ## [833] "60" "67" "5 8" "75" ## [837] "65" "65" "64" "62" ## [841] "180" "75" "5 11" "67" ## [845] "72" "5" "68" "64" ## [849] "180" "180" "6 1" "75.4" ## [853] "68" "70" "5.9" "63" ## [857] "5.2" "73" "72" "68" ## [861] "61" "60" "5.5" "63" ## [865] "67" "69" "74" "68" ## [869] "70" "73" "70" "60" ## [873] "66" "68" "5 7" "65" ## [877] "71" "5 10" "65" "72.8346" ## [881] "5.51" "5 10" "70" "65" ## [885] "5 10" "70" "67.7" "62" ## [889] "5 9" "5 9" "5 2" "74" ## [893] "5 11" "70" "64" "73" ## [897] "70" "66" "66" "5.8" ## [901] "62" "66" "64" "67" ## [905] "71" "78" "69" "5.7" ## [909] "70" "65" "70" "167" ## [913] "71" "68" "67" "168" ## [917] "64" "68" "60" "64" ## [921] "73.2" "6" "6.1" "71" ## [925] "72.44" "67" "71" "5 11" ## [929] "68.89" "68.89" "5.69" "178" ## [933] "182" "69" "164" "66" ## [937] "68" "70" "70" "5 8" ## [941] "70.1" "185" "6" "86" ## [945] "72" "66" "66" "69" ## [949] "66" "78.74" "5.7" "70" ## [953] "67" "66" "70" "708 661" ## [957] "5.25" "72.8346" "67.7" "72" ## [961] "69" "5.5" "5 6" "5 10" ## [965] "172" "67" "6" "5 8" ## [969] "69" "64.9" "160" "6 3" ## [973] "69" "64.96" "649 606" "10000" ## [977] "64.5" "64" "65" "70" ## [981] "5.1" "67" "152" "1" ## [985] "67.5" "180" "70.86" "728 346" ## [989] "175" "70" "158" "173" ## [993] "164" "6 04" "169" "0" ## [997] "70" "185" "168" "5 9" ## [1001] "169" "5 5" "68" "174" ## [1005] "68" "68.11024" "6.3" "69" ## [1009] "179" "66" "69" "5 7" ## [1013] "68.89" "73.22" "66" "5.5" ## [1017] "6" "75" "6" "170" ## [1021] "52" "52" "79" "70" ## [1025] "6" "172" "64" "68.11" ## [1029] "67" "158" "100" "75" ## [1033] "75" "81" "69" "68.8976378" ## [1037] "159" "76" "190" "69.6" ## [1041] "5.7" "63" "70" "72.44" ## [1045] "66" "170" "158" "73" ## [1049] "72" "60" "6 4" "65" ## [1053] "66" "180" "68.9" "5.57" ## [1057] "71" "5 4" "210" "72" ## [1061] "68" "67" "88" "69" ## [1065] "64.2" "6" "63" "162" ## [1069] "66" "170" "74" "72" ## [1073] "5.7" "71" "50" "170" ## [1077] "157" "186" "65" "170" ## [1081] "63" "7 283 465" "74" "67.7165" ## [1085] "5" "5" "34" "69" ## [1089] "69" "50" "69" "69" ## [1093] "161" "5 6" "5 6" ``` --- # 3. Remove Punctuation ```r heights2 <- reported_heights %>% mutate(height_clean = str_replace_all(height, "'|,|\"|,|ft|feet|inches|and|cm", " "), height_clean = str_squish(height_clean), * height_clean = str_replace(height_clean, " \\.", " ")) heights2$height_clean ``` ``` ## [1] "75" "70" "68" "74" ## [5] "61" "65" "66" "62" ## [9] "66" "67" "72" "6" ## [13] "69" "68" "69" "66" ## [17] "75" "64" "60" "67" ## [21] "66" "5 4" "70" "73" ## [25] "72" "69" "69" "72" ## [29] "64" "72" "75" "71" ## [33] "67" "66" "67" "69" ## [37] "68" "66.75" "72" "5.3" ## [41] "69" "68" "63" "60" ## [45] "73" "74" "74" "66" ## [49] "68" "73" "70" "68" ## [53] "73" "70.5" "165" "71" ## [57] "70" "67" "69" "67" ## [61] "69" "73" "74" "70" ## [65] "66" "511" "72" "65" ## [69] "65" "70" "73" "67" ## [73] "72" "68" "68" "65" ## [77] "72" "71" "65" "72" ## [81] "69" "70" "72" "6" ## [85] "62" "65" "70" "60" ## [89] "67" "62" "71" "63" ## [93] "68" "64.1732" "64" "71" ## [97] "68.5" "62" "2" "70" ## [101] "72" "71" "71" "69.6" ## [105] "65" "70" "61" "63" ## [109] "70" "5 7" "62" "68" ## [113] "73" "66.5" "69" "74" ## [117] "71.5" "76" "69" "74" ## [121] "74.5" "69" "66" "64" ## [125] "78" ">9000" "5 7" "69" ## [129] "67" "63" "74" "62" ## [133] "69" "64" "71" "62.5" ## [137] "68" "67" "71" "74" ## [141] "75" "65" "68" "65" ## [145] "66" "72" "73" "71" ## [149] "74" "5 3" "73" "68" ## [153] "77" "70.5" "63" "69" ## [157] "69" "68.89" "66.5" "64.173" ## [161] "63" "65" "64" "63" ## [165] "63" "69" "69" "64" ## [169] "62" "70" "70" "59" ## [173] "65" "67.7" "72" "74" ## [177] "71.7" "70.87" "66" "72" ## [181] "74" "69" "71" "70" ## [185] "70" "64" "5 8.11" "68" ## [189] "66" "64" "67" "65" ## [193] "72" "5.25" "70" "64.57" ## [197] "51" "63" "70" "68" ## [201] "67" "5 11" "69" "63" ## [205] "71" "70" "64" "70" ## [209] "68" "66" "69" "67" ## [213] "65" "72" "72" "70" ## [217] "75" "64" "66" "68" ## [221] "69" "73" "66" "71" ## [225] "67" "70" "67" "61" ## [229] "64" "64" "5.5" "72" ## [233] "70" "65" "11111" "5 9" ## [237] "68" "67" "59" "59" ## [241] "6" "6.5" "72" "150" ## [245] "5 10" "69" "72" "69" ## [249] "68.11" "68.11" "68" "103.2" ## [253] "65" "67" "67" "68" ## [257] "69" "80" "5.8" "68" ## [261] "19" "5" "70" "60" ## [265] "60" "64.96" "5.6" "175" ## [269] "177" "300" "68" "70" ## [273] "66" "66" "5 3" "72" ## [277] "68" "6" "72.05" "71" ## [281] "72.5" "70" "72" "6" ## [285] "68" "70" "5.9" "6 8" ## [289] "68" "66" "5 10" "5.5" ## [293] "178" "75" "66" "69" ## [297] "53" "71" "70" "72" ## [301] "68.5" "68" "163" "68" ## [305] "59.0551" "66.92" "72" "72" ## [309] "6.2" "175" "Five foot eight" "70" ## [313] "69" "61" "6.2" "65" ## [317] "72" "66" "5.8" "5.1" ## [321] "70" "65" "68" "178" ## [325] "70" "71" "74" "165" ## [329] "71" "74" "64" "5.11" ## [333] "66" "5 5" "165" "74" ## [337] "75" "180" "68" "5 2" ## [341] "74.8" "69" "68.4" "5.75" ## [345] "67" "65" "68" "67" ## [349] "69.3" "169" "67" "67" ## [353] "73" "75" "70" "71" ## [357] "5 4" "72" "70" "72" ## [361] "71.5" "72" "7" "68.8976" ## [365] "68" "5.4" "69" "70" ## [369] "75" "67" "157" "6.1" ## [373] "169" "5 3" "5.6" "214" ## [377] "183" "67.72" "65" "69" ## [381] "71" "5.6" "72" "72.4" ## [385] "69" "6" "79.05" "162" ## [389] "178" "180" "66.4" "72" ## [393] "5 10" "66" "69.29" "170" ## [397] "5 3" "66.1416" "78" "75" ## [401] "178" "70" "66" "65" ## [405] "71" "0.7" "68.5" "65" ## [409] "69" "190" "70" "66" ## [413] "66" "63" "71" "5.4" ## [417] "70" "72" "184" "5 7" ## [421] "68.5" "68" "5.9" "70" ## [425] "68.5" "5 12" "53.77" "66.92" ## [429] "73" "68" "72" "70" ## [433] "5.6" "71.5" "5.6" "184" ## [437] "72" "67" "6" "167" ## [441] "72" "74" "72" "72" ## [445] "67.7165" "71" "65" "2 33" ## [449] "72" "68" "5 11" "69" ## [453] "5 3" "71" "5.5" "5.2" ## [457] "67" "180" "54" "5.5" ## [461] "5.5" "6.5" "67" "68.1102" ## [465] "69" "5 8" "62" "72" ## [469] "70" "71" "64.961" "64" ## [473] "67" "180" "66" "183" ## [477] "170" "5 6" "66" "70" ## [481] "66" "64" "72" "70" ## [485] "172" "75" "72" "72" ## [489] "612" "68.5" "66" "5.11" ## [493] "71" "168" "72" "5 4" ## [497] "70" "67" "70" "70" ## [501] "72" "63" "65" "63" ## [505] "1 70" "70" "70.8" "172" ## [509] "87" "68" "66" "77.1654" ## [513] "5.5" "176" "5 7.5" "5 7.5" ## [517] "68" "66" "74" "72" ## [521] "72.83" "77" "111" "67" ## [525] "70" "64" "65" "73" ## [529] "5 2" "66.7" "69" "67" ## [533] "173" "69" "174" "176" ## [537] "175" "71" "64" "67" ## [541] "70" "68" "5 7.78" "68.5" ## [545] "68" "67.71" "66.5" "69" ## [549] "6.7" "64" "60" "12" ## [553] "6" "5.1" "74" "5.6" ## [557] "5.5" "67" "68.8" "69" ## [561] "78" "70" "66" "66" ## [565] "74" "71" "yyy" "73" ## [569] "67" "72" "70" "70.1" ## [573] "5.2" "69" "5 5" "5 8" ## [577] "72" "72" "72" "63" ## [581] "5 6" "67" "73.2" "64" ## [585] "74" "73.62" "72" "68.4" ## [589] "69" "61" "5 7" "89" ## [593] "69" "5.6" "68" "65" ## [597] "68.9" "67.71" "67" "67" ## [601] "59" "59" "5.7" "183" ## [605] "71" "172" "68" "66.9" ## [609] "34" "68.5" "74" "25" ## [613] "76" "60" "69" "6" ## [617] "69" "67" "70" "68.5" ## [621] "70" "67" "70" "69" ## [625] "65" "68.9" "5.9" "168" ## [629] "66.9291" "70" "6.5" "70.85" ## [633] "170" "175" "64" "69" ## [637] "67" "72" "6" "22" ## [641] "67" "72" "72" "72.83" ## [645] "70" "63" "71" "72" ## [649] "72" "72.44" "70" "68" ## [653] "59" "5.11" "684" "61.32" ## [657] "6" "1" "68" "73" ## [661] "71" "69" "71" "66.93" ## [665] "1" "58" "55" "70" ## [669] "6*12" "66" "65" "72" ## [673] "74" "5 11" "69" "87" ## [677] "162" "71" "72" "68" ## [681] "69" "70" "73" "165" ## [685] "184" "73.22" "64" "70" ## [689] "75" "63" "72.05" "6" ## [693] "68" "75" "68" "173" ## [697] "70" "66.1417" "74" "1.6" ## [701] "68" "68" "70" "71" ## [705] "172" "170" "70" "70.08" ## [709] "67.2" "69" "70" "69" ## [713] "72.44" "69" "72.45" "69" ## [717] "67" "75.98" "71" "67" ## [721] "71" "5.7" "5.5" "62" ## [725] "71" "74" "69" "72" ## [729] "174" "170" "68" "66.9291" ## [733] "63" "160" "120" "120" ## [737] "23" "70" "192" "68" ## [741] "69" "5 11" "71" "69.3" ## [745] "167" "70.866" "67" "67.72" ## [749] "70" "67" "65" "67" ## [753] "62.4" "71" "150" "72" ## [757] "68" "75" "72" "66" ## [761] "71" "76" "69" "69" ## [765] "1.7" "66" "174" "69.29" ## [769] "72" "5.8" "64" "66" ## [773] "70" "67" "75.6" "71.5" ## [777] "6" "63" "62" "69" ## [781] "68" "71.65" "75" "68.8976" ## [785] "74" "65" "5 4" "5 8" ## [789] "5 5" "64" "67" "69" ## [793] "5.8" "67" "68.8976" "62.6" ## [797] "70" "69" "65" "63" ## [801] "67.3" "64.2" "63" "5.1" ## [805] "67" "5.11" "61" "61" ## [809] "68.11" "66.14" "67.5" "60" ## [813] "5.7" "67.72" "66" "5 7" ## [817] "70" "5 6" "64.5" "68" ## [821] "69" "68" "5 11" "76" ## [825] "70.8661" "66" "67" "71" ## [829] "62.5" "5 7" "5 7" "172" ## [833] "60" "67" "5 8" "75" ## [837] "65" "65" "64" "62" ## [841] "180" "75" "5 11" "67" ## [845] "72" "5" "68" "64" ## [849] "180" "180" "6 1" "75.4" ## [853] "68" "70" "5.9" "63" ## [857] "5.2" "73" "72" "68" ## [861] "61" "60" "5.5" "63" ## [865] "67" "69" "74" "68" ## [869] "70" "73" "70" "60" ## [873] "66" "68" "5 7" "65" ## [877] "71" "5 10" "65" "72.8346" ## [881] "5.51" "5 10" "70" "65" ## [885] "5 10" "70" "67.7" "62" ## [889] "5 9" "5 9" "5 2" "74" ## [893] "5 11" "70" "64" "73" ## [897] "70" "66" "66" "5.8" ## [901] "62" "66" "64" "67" ## [905] "71" "78" "69" "5.7" ## [909] "70" "65" "70" "167" ## [913] "71" "68" "67" "168" ## [917] "64" "68" "60" "64" ## [921] "73.2" "6" "6.1" "71" ## [925] "72.44" "67" "71" "5 11" ## [929] "68.89" "68.89" "5.69" "178" ## [933] "182" "69" "164" "66" ## [937] "68" "70" "70" "5 8" ## [941] "70.1" "185" "6" "86" ## [945] "72" "66" "66" "69" ## [949] "66" "78.74" "5.7" "70" ## [953] "67" "66" "70" "708 661" ## [957] "5.25" "72.8346" "67.7" "72" ## [961] "69" "5.5" "5 6" "5 10" ## [965] "172" "67" "6" "5 8" ## [969] "69" "64.9" "160" "6 3" ## [973] "69" "64.96" "649 606" "10000" ## [977] "64.5" "64" "65" "70" ## [981] "5.1" "67" "152" "1" ## [985] "67.5" "180" "70.86" "728 346" ## [989] "175" "70" "158" "173" ## [993] "164" "6 04" "169" "0" ## [997] "70" "185" "168" "5 9" ## [1001] "169" "5 5" "68" "174" ## [1005] "68" "68.11024" "6.3" "69" ## [1009] "179" "66" "69" "5 7" ## [1013] "68.89" "73.22" "66" "5.5" ## [1017] "6" "75" "6" "170" ## [1021] "52" "52" "79" "70" ## [1025] "6" "172" "64" "68.11" ## [1029] "67" "158" "100" "75" ## [1033] "75" "81" "69" "68.8976378" ## [1037] "159" "76" "190" "69.6" ## [1041] "5.7" "63" "70" "72.44" ## [1045] "66" "170" "158" "73" ## [1049] "72" "60" "6 4" "65" ## [1053] "66" "180" "68.9" "5.57" ## [1057] "71" "5 4" "210" "72" ## [1061] "68" "67" "88" "69" ## [1065] "64.2" "6" "63" "162" ## [1069] "66" "170" "74" "72" ## [1073] "5.7" "71" "50" "170" ## [1077] "157" "186" "65" "170" ## [1081] "63" "7 283 465" "74" "67.7165" ## [1085] "5" "5" "34" "69" ## [1089] "69" "50" "69" "69" ## [1093] "161" "5 6" "5 6" ``` --- # 4. Calculate Total Inches Now separate the cleaned height into feet and inch variables ```r heights2 <- reported_heights %>% mutate(height_clean = str_replace_all(height, "'|,|\"|,|ft|feet|inches|and|cm", " "), height_clean = str_squish(height_clean), height_clean = str_replace(height_clean, " \\.", " ")) %>% separate_wider_delim(height_clean, # variable to split delim = " ", # delimiter to split on names = c("feet", "inches"), # new var names too_few = "align_end", # add just inches if too few too_many = "debug") %>% # add col to diagnose too many arrange(height_clean_ok) ``` --- # 4. Calculate Total Inches What new variables do we have? * `feet/inches`: the spliut variables we requested * `height_clean_ok`: boolean of whether we got the "right" number of pieces (1-2) * `height_clean_pieces`: numeric number of split pieces * `height_clean_remainder`: the extra string pieces when `\(>2\)` --- # 5. Deal with Extra Pieces Look at the top of the data. We can see there are a few values with extra pieces that are erroneous entries: ```r head(heights2, 4) ``` ``` ## # A tibble: 4 × 9 ## time_stamp sex height feet inches height_clean height_clean_ok ## <chr> <chr> <chr> <chr> <chr> <chr> <lgl> ## 1 2014-10-08 19:19:33 Female Five foo… Five foot Five foot e… FALSE ## 2 2017-08-09 12:16:38 Male 7,283,465 7 283 7 283 465 FALSE ## 3 2014-09-02 13:40:36 Male 75 <NA> 75 75 TRUE ## 4 2014-09-02 13:46:59 Male 70 <NA> 70 70 TRUE ## # ℹ 2 more variables: height_clean_pieces <int>, height_clean_remainder <chr> ``` --- # 5. Deal with Extra Pieces Let's remove those (I'm starting a new dataframe to iterate further) * Use an anonymous function (add `~` before function and `.x` for argument) ```r heights3 <- heights2 %>% # replace NA mutate(across(c(feet, inches), ~ifelse(is.na(.x), 0, .x))) %>% # apply ifelse across both feet/inches variables mutate(across(c(feet, inches), ~ifelse(height_clean_pieces ==3, NA, as.numeric(.x)))) %>% # drop variables we no longer need select(-height_clean_pieces, -height_clean_remainder, -height_clean_ok) head(heights3, 2) ``` ``` ## # A tibble: 2 × 6 ## time_stamp sex height feet inches height_clean ## <chr> <chr> <chr> <dbl> <dbl> <chr> ## 1 2014-10-08 19:19:33 Female Five foot eight inches NA NA Five foot eight ## 2 2017-08-09 12:16:38 Male 7,283,465 NA NA 7 283 465 ``` --- # 6. Make Combined Inch Measurement Now add a "clean" combined inch measurement: ```r heights3 <- heights3 %>% mutate(inches_clean = feet * 12 + inches)%>% arrange(inches_clean) ``` If you `View` the data, you'll find: * Many values between 5 and 7 which are clearly in .hi-medgrn[feet] instead of inches. * Many values between 150 and 214 which are clearly in .hi-blue[cm] instead of inches. --- # 7. Fix Units This is a good use case for `case_when`: ```r heights3 <- heights3 %>% mutate(inches_clean = case_when( # First convert values in feet inches_clean >= 5 & inches_clean <= 7 ~ inches_clean*12, # Next convert cm values between(inches_clean, 150, 214) ~ inches_clean / 2.54, # Otherwise, keep same value TRUE ~ inches_clean) ) ``` --- # 8. Check Plausible Range How many values are still outside a plausible range? ```r heights3 %>% mutate(ok = between(inches_clean, 3.5*12, 7.5*12)) %>% count(ok) ``` ``` ## # A tibble: 3 × 2 ## ok n ## <lgl> <int> ## 1 FALSE 29 ## 2 TRUE 1061 ## 3 NA 5 ``` --- # 9. Deal with Implausible Values What should we do with our implausible values? 1. Some of these may still contain interpretable information. .hi-slate[There may be more cleaning to do.] 1. Some of them may not, in which case we probably won't use them for analysis. * Don't discard them yet! We'll come back to extreme values (aka outliers) in a couple of weeks. 1. You'll find there are also a few instances where our cleaned value appears sensible, but the original value does not. * You may need to tweak the algorithm further. --- # LOOK AT THE DISTRIBUTIONS! .hi-medgrn[Pro Tip: always look at distributions of numeric variables!] ```r ggplot(heights3) + geom_histogram(aes(x = inches_clean), binwidth = 6) ``` <img src="data:image/png;base64,#04-Cleaning_files/figure-html/unnamed-chunk-37-1.png" width="68%" style="display: block; margin: auto;" /> --- # Aside: Regular Expressions .hi-megrn[Regular expressions] are code to .hi-medgrn[describe patterns in strings] that are common acros basically all programming languages ```r names <- c("Python", "SPSS", "Stata", "Julia") # Match strings that CONTAIN a lowercase "t" str_view_all(names, "t") ``` ``` ## [1] │ Py<t>hon ## [2] │ SPSS ## [3] │ S<t>a<t>a ## [4] │ Julia ``` --- # Common Regular Expressions Common regular expression operators include .pull-left[ Match strings that .hi-blue[start] with a capital "S": ```r str_view_all(names, "^S") ``` ``` ## [1] │ Python ## [2] │ <S>PSS ## [3] │ <S>tata ## [4] │ Julia ``` ] .pull-right[ Match strings that .hi-medgrn[end] with a lowercase "a": ```r str_view_all(names, "a$") ``` ``` ## [1] │ Python ## [2] │ SPSS ## [3] │ Stat<a> ## [4] │ Juli<a> ``` ] `^` and `$` are called .hi-slate[anchors]. --- # Common Regular Expressions .pull-left[ Match all lowercase vowels: ```r str_view_all(names, "[aeiou]") ``` ``` ## [1] │ Pyth<o>n ## [2] │ SPSS ## [3] │ St<a>t<a> ## [4] │ J<u>l<i><a> ``` ] .pull-right[ Match everything BUT lowercase vowels: ```r str_view_all(names, "[^aeiou]") ``` ``` ## [1] │ <P><y><t><h>o<n> ## [2] │ <S><P><S><S> ## [3] │ <S><t>a<t>a ## [4] │ <J>u<l>ia ``` ] --- # Common Regular Expressions .pull-left[ Use a vertical bar (`|`) for "or": ```r str_view_all(names, "Stata|SPSS") ``` ``` ## [1] │ Python ## [2] │ <SPSS> ## [3] │ <Stata> ## [4] │ Julia ``` ] .pull-right[ And parentheses to clarify: ```r str_view_all(names, "S(tata|PSS)") ``` ``` ## [1] │ Python ## [2] │ <SPSS> ## [3] │ <Stata> ## [4] │ Julia ``` ] --- # Last Remarks on Regular Expressions All kinds of regex cheat sheets and interactive testers are available via a quick Google. Regexps are hard to read and troubleshoot. Try not to get too deep into them -- you can often accomplish the same goal by breaking it up into smaller chunks. > Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems. - Jamie Zawinski --- # Last Remarks on Regular Expressions This is (the start of) a real regular expression that checks whether an email address is valid: `(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)` --- # Useful Functions for Cleaning Data .hi-slate[stringr] functions we've used here: * `str_replace` and `str_replace_all`: Replace parts of strings. * `str_trim` and `str_squish`: Remove extra spaces. * `str_view_all`: Illustrates matches, to help develop regular expressions. Other .hi-medgrn[tidyverse] functions we've used: * `between`: Test whether values fall within a numerical range. * `case_when`: Multiple conditional expressions. --- # Useful Functions for Cleaning Data Other useful .hi-slate[stringr] functions: * `str_sub`: Subset strings by position of characters. * `str_detect`: Test whether a string matches a pattern. * `str_extract` and `str_extract_all`: extract matching portion(s) of a string Other useful .hi-medgrn[tidyverse] functions: * `na_if`: Set a certain value to missing. * `bind_rows`: Append two datasets that have the same variable structure. * `replace_na`: Set missing values to a certain value. --- class: inverse, middle name: numbers # Number Storage --- # Floating Point Problems Simplify this expression: `\(1-\frac{1}{49}*49\)` It's obviously 0. Now ask R: ```r 1 - (1/49)*49 ``` -- ``` ## [1] 1.110223e-16 ``` This is called a .hi-medgrn[floating point] problem. It arises from the way computers store numbers. --- # Floating Point Problems R doesn't notice that `\(49/49\)` simplifies to 1. It just follows the order of operations. So the first thing it does is calculate: ```r (1/49) ``` ``` ## [1] 0.02040816 ``` Which is an irrational number. So R rounds it to 53 significant digits before multiplying by 49. --- # Floating Point Problems Most of the time, 53 digits is plenty of precision. But sometimes it creates problems. Note: This explanation is actually too simple. The floating-point issue goes .hi-medgrn[deeper than just irrational numbers]. Here's another example: ```r 1 - 0.9 - 0.1 ``` ``` ## [1] -2.775558e-17 ``` -- <br> In 1996, a floating-point error caused a European Space Agency rocket to [.hi-dkorange[self-destruct 37 seconds after it was launched.]](https://jam.dev/blog/famous-bugs-rocket-launch/) --- # Avoiding Floating Point Errors Pay attention to the data type of your variables. Avoid using logical conditions like `height==180` for numeric variables. * `height` may even read as `180` in the `View` window * But under the hood, it might still be stored as `180.000000000000173...`. What you can do instead: * .hi-medgrn[Best option:] `dplyr::near` compares numbers with a built-in tolerance. * Use `>` and `<` comparisons, or `between(height, 179.9, 180.1)`. * Convert in place: `as.integer(height) == 180` * Or with finer control: `round(height, digits=6) == 180` * If all values are integers, store the variable as an integer in the first place. --- # How to Store a Number? .hi-blue[Numeric] variables are stored in scientific notation. * Use to represent a single value, for which digits decrease in importance from left to right. * Example: My height is `172.962469405113283` cm. .hi-medgrn[Integer] variables lack decimal places. * Saves memory relative to numeric variables. * Stores values exactly, avoiding some floating-point problems. .hi-purple[Character] variables store the full sequence of digits literally. * Use when digits lack quantitative information, and each digit is equally important. * Phone numbers, credit card numbers, etc. * No chance of the right-most digits getting lost or corrupted. --- # More Variable Formats .hi-blue[Dates and times] allow you to easily do math and logic on dates and times. * See tidyverse package `lubridate`. .hi-medgrn[Factors] allow you to store values as numbers, but *display* them as strings. * This is useful for sorting things like month names: "Jan", "Feb", "Mar", "Apr".... * See tidyverse packages `forcats`. --- # Memory Space Memory space quickly becomes a problem when you work with large datasets. * But R does a reasonably good job of handling storage efficiently. Logical variables are smaller than integers, which are smaller than numeric. Does it save memory to store a variable as a factor instead of a string? * This used to be true: factor variables only store the factor labels once. * But no longer: R uses a global string pool - each unique string is only stored once. `pryr::object_size()` will tell you how much memory an object takes up (accounting for shared elements within an object). --- class: inverse, middle name: check # Data Cleaning Checklist --- # Data Cleaning Checklist **Part A.** Get to know your data frame. *** 1. **Convert file formats**, as necessary. 1. **Import data and wrangle into a tidy layout.** 1. **Remove irrelevant, garbage, or empty** columns and rows. 1. **Identify the primary key**, or define a surrogate key. 1. **Resolve duplicates** (remove true duplicates, or redefine the primary key). 1. **Understand the definition, origin, and units** of each variable, and document as necessary. 1. **Rename variables** as necessary, to be succinct and descriptive. --- # Data Cleaning Checklist **Part B.** Check your variables. *** .hi-blue[1\. Understand patterns of missing values.] - Find out why they're missing. - Make sure they are not more widespread than you expect. - Convert other intended designations (i.e., -1 or -999) to NA. - Distinguish between missing values and true zeros. .hi-medgrn[2\. Convert to numeric] when variables are inappropriately stored as strings. Correct typos as necessary. .hi-purple[3\.Convert to date/time] format where appropriate. --- # Data Cleaning Checklist **Part B.** Check your variables. *** .hi-blue[1\. Recode binary variables] as 0/1 as necessary. (Often stored as "Yes"/"No" or 1/2.) .hi-medgrn[2\. Convert to factors] when strings take a limited set of possible values. --- # Data Cleaning Checklist **Part C.** Check the values of your quantitative variables. *** .hi-medgrn[1\. Make units and scales consistent.] Avoid having in the same variable: - Some values in meters and others in feet. - Some values in USD and others in GBP. - Some percentages as 40% and others as 0.4. - Some values as millions and others as billions. .hi-purple[2\. Perform logical checks on quantitative variables:] - Define any range restrictions each variable should satisfy, and check them (graphically too!). - Correct any violations that are indisputable data entry mistakes. - Create a flag variable to mark remaining violations. --- # Data Cleaning Checklist **Part D.** Check the rest of your values. *** .hi-purple[1\. Clean string variables.] Some common operations: - Make entirely uppercase or lowercase - Remove punctuation - Trim spaces (extra, starting, ending) - Ensure order of names is consistent - Remove uninformative words like "the" and "a" - Correct spelling inconsistencies (consider text clustering packages) .hi-medgrn[2\. Literally look at your data] tables every step of the way, to spot issues you haven't thought of, and to make sure you're actually doing what you think you're doing. --- # Data Cleaning Checklist **Part E.** Finish up the cleaning phase. *** .hi-purple[1\. Save your clean data] to disk before further manipulation (merging dataframes, transforming variables, restricting the sample). Think of the whole wrangling/cleaning/analysis pipeline as 2 big phases: - Taking messy data from external sources and making a nice, neat table that you are likely to use for multiple purposes in analysis. - Taking that nice, neat table and doing all kinds of new things with it. .hi-medgrn[2\. Record all steps in a script.] .hi-blue[3\. Never overwrite the original raw data file.] --- # Data Cleaning Tips Whenever possible, make changes to values .hi-medgrn[only by logical conditions] on one or more substantive variables - .hi-slate[*not*] by observation ID or (even worse) row number. You want the changes you make to be rule-based, for 2 reasons: * So that they're .hi-blue[general] -- able to handle upstream changes to the data. * So that they're .hi-pink[principled] -- no one can accuse you of cherry-picking. --- # Table of Contents 1. [Prologue](#prologue) 1. [Paths and Importing Data](#import) 1. [Keys and Relational Data](#keys) 1. [String Cleaning](#string) 1. [Number Storage](#numbers) 1. [Data Cleaning Checklist](#check)